Best way to get all substrings or numbers matching a regex

Bonjour,

What is the best way for getting a String on Number Vector from a regex (like with the old matchall method)?

pat = r"[+-]?\d+\.?\d*"
txt = "aaa -1 bbb +2.2 ccc 123.456 ddd"

# findall?
ranges = findall(pat, txt)
words = map(range->txt[range], ranges) # could be shorter?
numbers = parse.(Float64, words)
# =>
# 3-element Array{Float64,1}:
#    1.0  
#   -1.0  
#   123.456


# eachmatch?
matches = eachmatch(pat, txt)
words = getfield.(matches, :match)
numbers = parse.(Float64, words)
# => idem (ok)

# versus matchall or getall or... (;-)
numbers = parse.(Float64, matchall(pat, txt))

– Maurice

Whats about numbers = parse.(Float64, [ match.match for match in eachmatch(pat, txt)])

Yes thank you, but I feel that:

words = [ match.match for match in eachmatch(pat, txt)]

is not really better than:

words = getfield.(eachmatch(pat, txt), :match)

(neither is the collect() version)

Ruby has a scan method which allow to write:

txt = "aaa -1 bbb +2.2 ccc 123.456 ddd"
words = txt.scan(/[+-]?\d+\.?\d*/)

witch would allow in Julia:

numbers = parse.(Float64, scan(pat, txt))

Those all seem like good ways - what’s your criteria for “best”? If performance, you can check with BenchmarkTools. If you mean stylistically, that’s more about your own preference. In my own code, I had a similar problem and did

numbers = map(eachmatch(pat, text)) do m
    parse(Float64, m.match)
end

Since it does it in one step without intermediate variables, but to me, it’s clearer than the broadcast or anonymous function versions. I don’t think any of these is more or less idiomatic Julia though.

Side note - I had to account for numbers like “1.233e5” in my code, which wouldn’t be matched by your regex. Maybe this doesn’t occur in your input, but I thought I’d mention it

3 Likes

My question was not about performances, but rather about shortness, readability and stylistic. I agree that your solution is a good candidat, but I would have liked another solution to be possible for getting all words.
The incomplete regex for Float was just an example, the question was mainly about getting all words in some text from a rexex.

– Maurice

I think the problem here is that eachmatch can’t be combined well with broadcast and parse, because eachmatch will return Base.RegexMatchIterator, which iterates through Regexmatch instead of directing returning the contents, so it can’t be passed to parse. Maybe you can define a new iterator which wraps Regexmatch and returns what you want, or you can find some other regex libraries and implement what you want in Julia?

If broadcating element Array access would be allowed

ranges = findall(pat, txt)
word1 = txt[ranges[1]]
# => ok
words = txt.[ranges]
# => error ; no broadcast with .[]

We could write

numbers = parse.(Float64, txt.[findall(pat, txt)])

If broadcating field access would be allowed

matches = collect( eachmatch(pat, txt) )
word1 = matches[1].match
# => ok
words = getfield.(matches, :match)
# => ok
words = matches..match
# => error ; no broadcast with ..

We could write

numbers = parse.(Float64, eachmatch(pat, txt)..match )

You could just roll your own:

julia> rubyscan(pat, str) = (m.match for m in eachmatch(pat, str)) # wrapping in parentheses instead of brackets so it doesn't allocate
rubyscan (generic function with 1 method)

julia> pat = r"[+-]?\d+\.?\d*"
r"[+-]?\d+\.?\d*"

julia> txt = "aaa -1 bbb +2.2 ccc 123.456 ddd"
"aaa -1 bbb +2.2 ccc 123.456 ddd"

julia> parse.(Float64, rubyscan(pat, txt))
3-element Array{Float64,1}:
  -1.0
   2.2
 123.456
2 Likes

Yes, but it’s not shorter since you have te define the rubyscan method first :slight_smile: (and thank you for your parentheses tip)

If one has to explain the code to someboby else, your three lines solution (post4) is yet the best one (with eachmatch replaced by findall, although findall is julia-1.3+ only).

numbers = map(findall(pat, txt)) do range
    parse(Float64, txt[range])
end
numbers = [parse(Float64, m.match) for m in eachmatch(pat, txt)]

?

1 Like