Questions about string

I want the number before the string c= “9S8M13S” 's S and then return 9 and 13.And I tried following.

julia> c="9S8M13S"
"9S8M13S"
julia> fs=findall(r"\d*S",c)
2-element Vector{UnitRange{Int64}}:
 1:2
 5:7
julia> first.(fs)
2-element Vector{Int64}:
 1
 5
julia> last.(fs).-1
2-element Vector{Int64}:
 1
 6
julia> z=zip(first.(fs),last.(fs).-1)
zip([1, 5], [1, 6])

a=Int[]

for (i,j) in z
    push!(a,parse(Int,c[i:j]))
end

julia> a
2-element Vector{Int64}:
  9
 13

Is there a more direct way to do it? I think it is a little bit undirect using this method.

1 Like

You can use lookaround operators

julia> c = "9S45M11S"
"9S45M11S"

julia> eachmatch(r"\d+(?=S)", c)
Base.RegexMatchIterator(r"\d+(?=S)", "9S45M11S", false)

They actually match only the numbers:

julia> collect(eachmatch(r"\d+(?=S)", c))
2-element Vector{RegexMatch}:
 RegexMatch("9")
 RegexMatch("11")

Now, I would have expected this to work, but sadly, and confusingly, it doesn’t:

julia> parse.(Int, eachmatch(r"\d+(?=S)", c))
ERROR: MethodError: no method matching parse(::Type{Int64}, ::RegexMatch)

Instead, it seems I must dig around in the internals of the match object, for example like this:

julia> [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]
2-element Vector{Int64}:
  9
 11

This should be very efficient, but it’s not nice to have to dig into the internal match field of the object, and it always seemed to me like an outlier in the language.

3 Likes

You may use capture groups instead of lookarounds:

julia> [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", c)]
2-element Vector{Int64}:
  9
 13
3 Likes

This then permits broadcasting:

julia> parse.(Int, first.(eachmatch(r"(\d+)S", c)))
2-element Vector{Int64}:
  9
 11

I do wonder, though, why match and eachmatch don’t simply return the actual matches. There could be a separate captures/eachcapture, no? When I ask for the match, that’s what I want.

OK.I will try that.Thanks for response.

Using lookahead seems to be slightly faster, with fewer allocations, though:

julia> @btime [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", $c)]
  788.506 ns (11 allocations: 656 bytes)
2-element Vector{Int64}:
  9
 11

julia> @btime [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", $c)]
  699.291 ns (9 allocations: 560 bytes)
2-element Vector{Int64}:
  9
 11
1 Like

What do you want it to return? Isn’t this a match?

julia> eachmatch(r"(\d)", "a1b2")|>first
RegexMatch("1", 1="1")

I’d like it to return an AbstractString, like SubString, which is what m.match is. I don’t like that I need to reach into a field to get access to the actual string match. It seems un-idiomatic.

1 Like

You’re right, usually struct fields are “implementation details”. But in this case, note that docs tell us that this is the intended way for users to access the matching substring.

match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])

Search for the first match of the regular expression r in s and return a RegexMatch object containing the match, or nothing if the match failed. The matching substring can be retrieved by accessing m.match and the captured sequences can be retrieved by accessing m.captures. The optional idx argument specifies an index at which to start the search.

Yeah, I know. I just don’t like that it is different from what I think of as “idiomatic”.

And it makes broadcasting difficult.

1 Like

Here is a possible alternative to regex.
It seems to be efficient from the example tested, but there might be a catch.

numberstrip1(c) = [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]

function numberstrip2(c)
    v = split(c, 'S')[1:end-1]
    i = findlast.(!isdigit, v)
    return parse.(Int, [isnothing(j) ? u : u[j+1:end] for (j,u) in zip(i,v)])
end

c = repeat("9S8M13S1",1000)

@btime numberstrip1($c)                     # 503 μs (6008 allocs: 398 KiB)
@btime numberstrip2($c)                     # 232 μs (  17 allocs: 288 KiB)

In recent Julia versions the Match indexes by captures:

julia> match(r"(\d)", "a5b6")[1]
"5"

This isn’t what you asked for, it seems m.match is still needed but .captures isn’t.

In fact, there are some potential unmanaged situations.

julia> ss ="S91S87M4SgS43SS"
"S91S87M4SgS43SS"

julia> numberstrip2(ss)
ERROR: ArgumentError: input string is empty or only contains whitespace
Stacktrace:

julia> digit_S(ss)
3-element Vector{Any}:
 91
  4
 43

ss ="S91S87M4mSSgS4S3SMS"

digit_S(ss) == ns(ss)

This other scheme to test (to see if it works in “all” cases) seems faster.
Probably by reasoning more carefully it can be further simplified and made faster.
Edited


function digit_S(s)
    prev_digit=false
    res=Int[]
    v=0
for c in s 
    if isdigit(c) 
        v=v*10+codepoint(c)-0x30
        prev_digit=true
    elseif c=='S'&& prev_digit
        push!(res,v)
        v=0
        prev_digit=false
    else 
        v=0
        prev_digit=false 
    end
end
res
end

julia> s ="91S87M4Sg43"
"91S87M4Sg43"


ns(s)=[parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", s)]


julia> s10t = repeat(s,10^4)
...
julia> @btime digit_S($s10t);
  261.600 μs (9 allocations: 326.55 KiB)

julia> @btime numberstrip2($s10t);
  3.139 ms (20 allocations: 2.20 MiB)

julia> @btime ns($s10t);
  5.430 ms (60010 allocations: 3.68 MiB)

julia> ns(s10t)==numberstrip2(s10t)==digit_S(s10t)
true



function digit_cuS(s)
    prev_digit=false
    res=Int[]
    v=0
    cus=codeunits(s)
for c in cus 
    if 0x30 <= c <= 0x39
        v=v*10+c-0x30
        prev_digit=true
    elseif c==codepoint('S')&& prev_digit
        push!(res,v)
        v=0
        prev_digit=false
    else 
        v=0
        prev_digit=false 
    end
end
res
end

julia> @btime digit_cuS($s10t);
  195.300 μs (9 allocations: 326.55 KiB)
julia> @btime ns($s10t);
  5.351 ms (60010 allocations: 3.68 MiB)
julia> function digit_cuS1(s)
           prev_digit=false
           res=Vector{Int}(undef,20000)
           v=0
           cus=codeunits(s)
           i=1
       for c in cus
           if 0x30 <= c <= 0x39
               v=v*10+c-0x30
               prev_digit=true
           elseif c==codepoint('S')&& prev_digit       
               res[i]=v #push!(res,v)
               v=0
               prev_digit=false
               i+=1
           else
               v=0
               prev_digit=false
           end
       end
       res[1:i-1]
       end
digit_cuS1 (generic function with 1 method)

julia> @btime digit_cuS1($s10t);
  98.000 μs (4 allocations: 312.59 KiB)
parse.(Int,filter(!isempty,[ isnothing(findlast(!isdigit, e)) ? e : e[findlast(!isdigit, e)+1:end] for e in split(s10t,'S') ]))
2 Likes

I try your function works well!

But, how to make it works for multiple characters / a word for example before the word Freya:

(it can only works for one letter…)

s ="91S87M4Sg43S18FreyaS18S"

function codebreaker(s)
           prev_digit=false
           res=Vector{Int}(undef,20000)
           v=0
           cus=codeunits(s)
           i=1
       for c in cus
           if 0x30 <= c <= 0x39
               v=v*10+c-0x30
               prev_digit=true
           elseif c==codepoint('S')&& prev_digit       
               res[i]=v #push!(res,v)
               v=0
               prev_digit=false
               i+=1
           else
               v=0
               prev_digit=false
           end
       end
       res[1:i-1]
end

# Type codebreaker(s)

Great question! keeps asking like this again, we all learn a lot from each other

function codebreaker(s, pattern)
    p=codeunits(pattern)
    cus=codeunits(s)
    prev_digit=false
    res=Vector{Int}(undef,count(==(p[1]),cus))
    v=0
    i=1
for (j,c) in enumerate(cus[1:end-size(p,1)+1])
    if 0x30 <= c <= 0x39
        v=v*10+c-0x30
        prev_digit=true
    elseif cus[j:j+size(p,1)-1]==p && prev_digit       
        res[i]=v #push!(res,v)
        v=0
        prev_digit=false
        i+=1
    else
        v=0
        prev_digit=false
    end
end
res[1:i-1]
end


julia> s ="91S87FreyaM97878Freyauy78Freya5Freya999Frey"   
"91S87FreyaM97878Freyauy78Freya5Freya999Frey"

julia> codebreaker(s,"Freya")
4-element Vector{Int64}:
    87
 97878
    78
     5
2 Likes

@rocco_sprmnt21, you seem to have turned on the turbo on your Ferrari!

Thanks for spotting the limitation. I’ve done a little maintenance here on my Fiat Cinquecento, just trying to get to the destination in one piece:

numberstrip3()
function numberstrip3(c, pattern)
    d = split(c, pattern)
    pop!(d)
    v = filter(!isempty, d)
    ix = findlast.(!isdigit, v)
    s = Int[]
    for (i,u) in zip(ix,v)
        if isnothing(i)
            push!(s, parse(Int,u))
        else
            x = tryparse(Int, u[i+1:end])
            !isnothing(x) && push!(s, x)
        end
    end
    return s
end
1 Like

paradoxically, the generalization of the algorithm from character to string makes it more efficient

function codebreaker1(s, pattern)
    p=codeunits(pattern)
    cus=codeunits(s)
    is_prev_digit=false
    res=Vector{Int}(undef,count(==(p[1]),cus))
    v=0
    i=j=1
    while j <=size(cus,1)-size(p,1)+1
        if 0x30 <= cus[j] <= 0x39
            v=v*10+cus[j]-0x30
            is_prev_digit=true
        elseif cus[j:j+size(p,1)-1]==p && is_prev_digit       
            res[i]=v 
            v=0
            is_prev_digit=false
            i+=1
            j+=size(p,1)-1
        else
            v=0
            is_prev_digit=false
        end
        j+=1
    end
    res[1:i-1]
end

try this
bads="565Freyaiu87Frϵyα98wey78FreyauuFreya"

The Cinquecento spits out :

numberstrip3(bads,"Freya")
2-element Vector{Int64}:
 565
  78

the same result as the Ferrari, but at its own pace.

1 Like