Questions about string

zhangchunyong · October 12, 2022, 3:51am

I want the number before the string c= “9S8M13S” 's S and then return 9 and 13.And I tried following.

julia> c="9S8M13S"
"9S8M13S"
julia> fs=findall(r"\d*S",c)
2-element Vector{UnitRange{Int64}}:
 1:2
 5:7
julia> first.(fs)
2-element Vector{Int64}:
 1
 5
julia> last.(fs).-1
2-element Vector{Int64}:
 1
 6
julia> z=zip(first.(fs),last.(fs).-1)
zip([1, 5], [1, 6])

a=Int[]

for (i,j) in z
    push!(a,parse(Int,c[i:j]))
end

julia> a
2-element Vector{Int64}:
  9
 13

Is there a more direct way to do it? I think it is a little bit undirect using this method.

DNF · October 12, 2022, 4:55am

You can use lookaround operators

julia> c = "9S45M11S"
"9S45M11S"

julia> eachmatch(r"\d+(?=S)", c)
Base.RegexMatchIterator(r"\d+(?=S)", "9S45M11S", false)

They actually match only the numbers:

julia> collect(eachmatch(r"\d+(?=S)", c))
2-element Vector{RegexMatch}:
 RegexMatch("9")
 RegexMatch("11")

Now, I would have expected this to work, but sadly, and confusingly, it doesn’t:

julia> parse.(Int, eachmatch(r"\d+(?=S)", c))
ERROR: MethodError: no method matching parse(::Type{Int64}, ::RegexMatch)

Instead, it seems I must dig around in the internals of the match object, for example like this:

julia> [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]
2-element Vector{Int64}:
  9
 11

This should be very efficient, but it’s not nice to have to dig into the internal match field of the object, and it always seemed to me like an outlier in the language.

Jollywatt · October 12, 2022, 6:16am

You may use capture groups instead of lookarounds:

julia> [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", c)]
2-element Vector{Int64}:
  9
 13

DNF · October 12, 2022, 6:57am

This then permits broadcasting:

julia> parse.(Int, first.(eachmatch(r"(\d+)S", c)))
2-element Vector{Int64}:
  9
 11

I do wonder, though, why match and eachmatch don’t simply return the actual matches. There could be a separate captures/eachcapture, no? When I ask for the match, that’s what I want.

zhangchunyong · October 12, 2022, 6:58am

OK.I will try that.Thanks for response.

DNF · October 12, 2022, 7:03am

Using lookahead seems to be slightly faster, with fewer allocations, though:

julia> @btime [parse(Int, m[1]) for m in eachmatch(r"(\d+)S", $c)]
  788.506 ns (11 allocations: 656 bytes)
2-element Vector{Int64}:
  9
 11

julia> @btime [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", $c)]
  699.291 ns (9 allocations: 560 bytes)
2-element Vector{Int64}:
  9
 11

jar1 · October 12, 2022, 7:04am

What do you want it to return? Isn’t this a match?

julia> eachmatch(r"(\d)", "a1b2")|>first
RegexMatch("1", 1="1")

DNF · October 12, 2022, 7:06am

I’d like it to return an AbstractString, like SubString, which is what m.match is. I don’t like that I need to reach into a field to get access to the actual string match. It seems un-idiomatic.

Jollywatt · October 12, 2022, 7:38am

You’re right, usually struct fields are “implementation details”. But in this case, note that docs tell us that this is the intended way for users to access the matching substring.

match(r::Regex, s::AbstractString[, idx::Integer[, addopts]])

Search for the first match of the regular expression r in s and return a RegexMatch object containing the match, or nothing if the match failed. The matching substring can be retrieved by accessing m.match and the captured sequences can be retrieved by accessing m.captures. The optional idx argument specifies an index at which to start the search.

DNF · October 12, 2022, 7:39am

Yeah, I know. I just don’t like that it is different from what I think of as “idiomatic”.

And it makes broadcasting difficult.

rafael.guerra · October 12, 2022, 3:03pm

Here is a possible alternative to regex.
It seems to be efficient from the example tested, but there might be a catch.

numberstrip1(c) = [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]

function numberstrip2(c)
    v = split(c, 'S')[1:end-1]
    i = findlast.(!isdigit, v)
    return parse.(Int, [isnothing(j) ? u : u[j+1:end] for (j,u) in zip(i,v)])
end

c = repeat("9S8M13S1",1000)

@btime numberstrip1($c)                     # 503 μs (6008 allocs: 398 KiB)
@btime numberstrip2($c)                     # 232 μs (  17 allocs: 288 KiB)

jar1 · January 13, 2023, 5:38am

In recent Julia versions the Match indexes by captures:

julia> match(r"(\d)", "a5b6")[1]
"5"

This isn’t what you asked for, it seems m.match is still needed but .captures isn’t.

rocco_sprmnt21 · January 13, 2023, 8:52am

In fact, there are some potential unmanaged situations.

julia> ss ="S91S87M4SgS43SS"
"S91S87M4SgS43SS"

julia> numberstrip2(ss)
ERROR: ArgumentError: input string is empty or only contains whitespace
Stacktrace:

julia> digit_S(ss)
3-element Vector{Any}:
 91
  4
 43

ss ="S91S87M4mSSgS4S3SMS"

digit_S(ss) == ns(ss)

This other scheme to test (to see if it works in “all” cases) seems faster.
Probably by reasoning more carefully it can be further simplified and made faster.
Edited


function digit_S(s)
    prev_digit=false
    res=Int[]
    v=0
for c in s 
    if isdigit(c) 
        v=v*10+codepoint(c)-0x30
        prev_digit=true
    elseif c=='S'&& prev_digit
        push!(res,v)
        v=0
        prev_digit=false
    else 
        v=0
        prev_digit=false 
    end
end
res
end

julia> s ="91S87M4Sg43"
"91S87M4Sg43"


ns(s)=[parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", s)]


julia> s10t = repeat(s,10^4)
...
julia> @btime digit_S($s10t);
  261.600 μs (9 allocations: 326.55 KiB)

julia> @btime numberstrip2($s10t);
  3.139 ms (20 allocations: 2.20 MiB)

julia> @btime ns($s10t);
  5.430 ms (60010 allocations: 3.68 MiB)

julia> ns(s10t)==numberstrip2(s10t)==digit_S(s10t)
true


function digit_cuS(s)
    prev_digit=false
    res=Int[]
    v=0
    cus=codeunits(s)
for c in cus 
    if 0x30 <= c <= 0x39
        v=v*10+c-0x30
        prev_digit=true
    elseif c==codepoint('S')&& prev_digit
        push!(res,v)
        v=0
        prev_digit=false
    else 
        v=0
        prev_digit=false 
    end
end
res
end

julia> @btime digit_cuS($s10t);
  195.300 μs (9 allocations: 326.55 KiB)
julia> @btime ns($s10t);
  5.351 ms (60010 allocations: 3.68 MiB)

julia> function digit_cuS1(s)
           prev_digit=false
           res=Vector{Int}(undef,20000)
           v=0
           cus=codeunits(s)
           i=1
       for c in cus
           if 0x30 <= c <= 0x39
               v=v*10+c-0x30
               prev_digit=true
           elseif c==codepoint('S')&& prev_digit       
               res[i]=v #push!(res,v)
               v=0
               prev_digit=false
               i+=1
           else
               v=0
               prev_digit=false
           end
       end
       res[1:i-1]
       end
digit_cuS1 (generic function with 1 method)

julia> @btime digit_cuS1($s10t);
  98.000 μs (4 allocations: 312.59 KiB)

parse.(Int,filter(!isempty,[ isnothing(findlast(!isdigit, e)) ? e : e[findlast(!isdigit, e)+1:end] for e in split(s10t,'S') ]))

Freya_the_Goddess · January 13, 2023, 1:42pm

I try your function works well!

But, how to make it works for multiple characters / a word for example before the word Freya:

(it can only works for one letter…)

s ="91S87M4Sg43S18FreyaS18S"

function codebreaker(s)
           prev_digit=false
           res=Vector{Int}(undef,20000)
           v=0
           cus=codeunits(s)
           i=1
       for c in cus
           if 0x30 <= c <= 0x39
               v=v*10+c-0x30
               prev_digit=true
           elseif c==codepoint('S')&& prev_digit       
               res[i]=v #push!(res,v)
               v=0
               prev_digit=false
               i+=1
           else
               v=0
               prev_digit=false
           end
       end
       res[1:i-1]
end

# Type codebreaker(s)

Freya_the_Goddess · January 13, 2023, 1:46pm

Great question! keeps asking like this again, we all learn a lot from each other

rocco_sprmnt21 · January 13, 2023, 3:21pm

function codebreaker(s, pattern)
    p=codeunits(pattern)
    cus=codeunits(s)
    prev_digit=false
    res=Vector{Int}(undef,count(==(p[1]),cus))
    v=0
    i=1
for (j,c) in enumerate(cus[1:end-size(p,1)+1])
    if 0x30 <= c <= 0x39
        v=v*10+c-0x30
        prev_digit=true
    elseif cus[j:j+size(p,1)-1]==p && prev_digit       
        res[i]=v #push!(res,v)
        v=0
        prev_digit=false
        i+=1
    else
        v=0
        prev_digit=false
    end
end
res[1:i-1]
end


julia> s ="91S87FreyaM97878Freyauy78Freya5Freya999Frey"   
"91S87FreyaM97878Freyauy78Freya5Freya999Frey"

julia> codebreaker(s,"Freya")
4-element Vector{Int64}:
    87
 97878
    78
     5

rafael.guerra · January 13, 2023, 8:36pm

@rocco_sprmnt21, you seem to have turned on the turbo on your Ferrari!

Thanks for spotting the limitation. I’ve done a little maintenance here on my Fiat Cinquecento, just trying to get to the destination in one piece:

numberstrip3()

function numberstrip3(c, pattern)
    d = split(c, pattern)
    pop!(d)
    v = filter(!isempty, d)
    ix = findlast.(!isdigit, v)
    s = Int[]
    for (i,u) in zip(ix,v)
        if isnothing(i)
            push!(s, parse(Int,u))
        else
            x = tryparse(Int, u[i+1:end])
            !isnothing(x) && push!(s, x)
        end
    end
    return s
end

rocco_sprmnt21 · January 13, 2023, 9:25pm

paradoxically, the generalization of the algorithm from character to string makes it more efficient

function codebreaker1(s, pattern)
    p=codeunits(pattern)
    cus=codeunits(s)
    is_prev_digit=false
    res=Vector{Int}(undef,count(==(p[1]),cus))
    v=0
    i=j=1
    while j <=size(cus,1)-size(p,1)+1
        if 0x30 <= cus[j] <= 0x39
            v=v*10+cus[j]-0x30
            is_prev_digit=true
        elseif cus[j:j+size(p,1)-1]==p && is_prev_digit       
            res[i]=v 
            v=0
            is_prev_digit=false
            i+=1
            j+=size(p,1)-1
        else
            v=0
            is_prev_digit=false
        end
        j+=1
    end
    res[1:i-1]
end

rocco_sprmnt21 · January 13, 2023, 10:14pm

try this
bads="565Freyaiu87Frϵyα98wey78FreyauuFreya"

rafael.guerra · January 13, 2023, 10:22pm

The Cinquecento spits out :

numberstrip3(bads,"Freya")
2-element Vector{Int64}:
 565
  78

the same result as the Ferrari, but at its own pace.

Topic		Replies	Views
Correct usage of regex matches New to Julia regex	5	659	May 9, 2021
Searching for a regular expression inside an array New to Julia	16	5855	October 15, 2018
Best way to get all substrings or numbers matching a regex General Usage strings , regex , parsing	9	8390	November 27, 2019
Test if string begins with number New to Julia question	6	2210	May 4, 2019
Match a string literal via regex General Usage question , strings , regex	12	3585	May 10, 2019

Questions about string

Related topics