Find the location(s) of all occurrences of a specific string in .txt

I am trying to do the following:

  1. Open a .txt file placed at a specific directory
  2. Find the location(s) of all occurrences of a specific string

An example of the .txt that am working with is found at this link (link to .txt)(https://www.dropbox.com/s/l39wnahi40137ib/1.122430932.txt?dl=0)

I have tried the following:

f = open(“C:/Users/me/data/1.122430932.txt”)
pattern = r"ltp"
target = f
m = match(pattern, target)

However, I get the error " MethodError: no method matching match(::Regex, ::IOStream)

What is the correct procedure?

Use read to read the whole file as a string, if this is not a problem for memory. Also, prefer using version of open that takes a block, to avoid keeping the file open.

f = open(“C:/Users/me/data/1.122430932.txt”)
pattern = r"ltp"
target = read(f, String)
m = match(pattern, target)

Please, use triple backticks to past code or terminal outputs in your posts.

```
Example.
```

becomes

Example.
3 Likes

When I run the suggested code, I do not get any information about the positions of the target string. But the code does not throw an error. I changed the search criteria to “op” because, as can be seen in the image below, “op” is clearly part of the string.

I was expectung to get a list of positions.

Using eachmatch is the way to go here consider:

s = """Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke. Turnip greens yarrow ricebean rutabaga endive cauliflower sea lettuce kohlrabi amaranth water spinach avocado daikon napa cabbage asparagus winter purslane kale. Celery potato scallion desert raisin horseradish spinach carrot soko. Lotus root water spinach fennel kombu maize bamboo shoot green bean swiss chard seakale pumpkin onion chickpea gram corn pea. Brussels sprout coriander water chestnut gourd swiss chard wakame kohlrabi beetroot carrot watercress. Corn amaranth salsify bunya nuts nori azuki bean chickweed potato bell pepper artichoke. """

Which is the same text twice:

for m in eachmatch(r"bamboo", s)
  @show m
end

Right it finds two of those, but where are they, well you can retrieve that from a RegexMatch object:

for m in eachmatch(r"bamboo", s)
  @show m.offset
end

Does that work?

len = length("bamboo")
@show s[277:277+len]
@show s[827:827+len]

More generally, let’s consider a single regex match object:

m = match(r"bamboo", s)

The key fields are

  • captures (here empty, no capturing group in the regex)
  • match (the whole match)
  • offset (where it is)

You could do something like the following to get ranges:

for m in eachmatch(r"bamboo", s)
  a = m.offset
  b = prevind(s, m.offset + lastindex(m.match))
  @show s[a:b]
end

The prevind and lastindex help avoid issues if there are unicode chars like α

PS: maybe to actually answer your question:

[m.offset for m in eachmatch(r"bamboo", s)]
2 Likes