Nash
May 1, 2020, 8:53pm
1
I am trying to do the following:
Open a .txt file placed at a specific directory
Find the location(s) of all occurrences of a specific string
An example of the .txt that am working with is found at this link (link to .txt )(Dropbox - Error )
I have tried the following:
f = open(“C:/Users/me/data/1.122430932.txt”)
pattern = r"ltp"
target = f
m = match(pattern, target)
However, I get the error " MethodError: no method matching match(::Regex, ::IOStream)
What is the correct procedure?
Use read
to read the whole file as a string, if this is not a problem for memory. Also, prefer using version of open
that takes a block, to avoid keeping the file open.
f = open(“C:/Users/me/data/1.122430932.txt”)
pattern = r"ltp"
target = read(f, String)
m = match(pattern, target)
Please, use triple backticks to past code or terminal outputs in your posts.
```
Example.
```
becomes
Example.
3 Likes
Nash
May 2, 2020, 8:28am
3
When I run the suggested code, I do not get any information about the positions of the target string. But the code does not throw an error. I changed the search criteria to “op” because, as can be seen in the image below, “op” is clearly part of the string.
I was expectung to get a list of positions.
Using eachmatch
is the way to go here consider:
s = """Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke. Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke. """
Which is the same text twice:
for m in eachmatch(r"bamboo", s)
@show m
end
Right it finds two of those, but where are they, well you can retrieve that from a RegexMatch
object:
for m in eachmatch(r"bamboo", s)
@show m.offset
end
Does that work?
len = length("bamboo")
@show s[277:277+len]
@show s[827:827+len]
More generally, let’s consider a single regex match object:
m = match(r"bamboo", s)
The key fields are
captures
(here empty, no capturing group in the regex)
match
(the whole match)
offset
(where it is)
You could do something like the following to get ranges:
for m in eachmatch(r"bamboo", s)
a = m.offset
b = prevind(s, m.offset + lastindex(m.match))
@show s[a:b]
end
The prevind
and lastindex
help avoid issues if there are unicode chars like α
PS : maybe to actually answer your question:
[m.offset for m in eachmatch(r"bamboo", s)]
2 Likes