Questions about string

I want the number before the string c= “9S8M13S” 's S and then return 9 and 13.And I tried following.

julia> c="9S8M13S"
"9S8M13S"
julia> fs=findall(r"\d*S",c)
2-element Vector{UnitRange{Int64}}:
 1:2
 5:7
julia> first.(fs)
2-element Vector{Int64}:
 1
 5
julia> last.(fs).-1
2-element Vector{Int64}:
 1
 6
julia> z=zip(first.(fs),last.(fs).-1)
zip([1, 5], [1, 6])

a=Int[]

for (i,j) in z
    push!(a,parse(Int,c[i:j]))
end

julia> a
2-element Vector{Int64}:
  9
 13

Is there a more direct way to do it? I think it is a little bit undirect using this method.

You can use lookaround operators

julia> c = "9S45M11S"
"9S45M11S"

julia> eachmatch(r"\d+(?=S)", c)
Base.RegexMatchIterator(r"\d+(?=S)", "9S45M11S", false)

They actually match only the numbers:

julia> collect(eachmatch(r"\d+(?=S)", c))
2-element Vector{RegexMatch}:
 RegexMatch("9")
 RegexMatch("11")

Now, I would have expected this to work, but sadly, and confusingly, it doesn’t:

julia> parse.(Int, eachmatch(r"\d+(?=S)", c))
ERROR: MethodError: no method matching parse(::Type{Int64}, ::RegexMatch)

Instead, it seems I must dig around in the internals of the match object, for example like this:

julia> [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]
2-element Vector{Int64}:
  9
 11

This should be very efficient, but it’s not nice to have to dig into the internal match field of the object, and it always seemed to me like an outlier in the language.

2 Likes

You may use capture groups instead of lookarounds:

julia> [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", c)]
2-element Vector{Int64}:
  9
 13
2 Likes

This then permits broadcasting:

julia> parse.(Int, first.(eachmatch(r"(\d+)S", c)))
2-element Vector{Int64}:
  9
 11

I do wonder, though, why match and eachmatch don’t simply return the actual matches. There could be a separate captures/eachcapture, no? When I ask for the match, that’s what I want.

OK.I will try that.Thanks for response.

Using lookahead seems to be slightly faster, with fewer allocations, though:

julia> @btime [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", $c)]
  788.506 ns (11 allocations: 656 bytes)
2-element Vector{Int64}:
  9
 11

julia> @btime [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", $c)]
  699.291 ns (9 allocations: 560 bytes)
2-element Vector{Int64}:
  9
 11
1 Like

What do you want it to return? Isn’t this a match?

julia> eachmatch(r"(\d)", "a1b2")|>first
RegexMatch("1", 1="1")

I’d like it to return an AbstractString, like SubString, which is what m.match is. I don’t like that I need to reach into a field to get access to the actual string match. It seems un-idiomatic.

1 Like

You’re right, usually struct fields are “implementation details”. But in this case, note that docs tell us that this is the intended way for users to access the matching substring.

match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])

Search for the first match of the regular expression r in s and return a RegexMatch object containing the match, or nothing if the match failed. The matching substring can be retrieved by accessing m.match and the captured sequences can be retrieved by accessing m.captures. The optional idx argument specifies an index at which to start the search.

Yeah, I know. I just don’t like that it is different from what I think of as “idiomatic”.

And it makes broadcasting difficult.

Here is a possible alternative to regex.
It seems to be efficient from the example tested, but there might be a catch.

numberstrip1(c) = [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]

function numberstrip2(c)
    v = split(c, 'S')[1:end-1]
    i = findlast.(!isdigit, v)
    return parse.(Int, [isnothing(j) ? u : u[j+1:end] for (j,u) in zip(i,v)])
end

c = repeat("9S8M13S1",1000)

@btime numberstrip1($c)                     # 503 μs (6008 allocs: 398 KiB)
@btime numberstrip2($c)                     # 232 μs (  17 allocs: 288 KiB)