Instead, it seems I must dig around in the internals of the match object, for example like this:
julia> [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]
2-element Vector{Int64}:
9
11
This should be very efficient, but it’s not nice to have to dig into the internal match field of the object, and it always seemed to me like an outlier in the language.
I do wonder, though, why match and eachmatch don’t simply return the actual matches. There could be a separate captures/eachcapture, no? When I ask for the match, that’s what I want.
I’d like it to return an AbstractString, like SubString, which is what m.match is. I don’t like that I need to reach into a field to get access to the actual string match. It seems un-idiomatic.
You’re right, usually struct fields are “implementation details”. But in this case, note that docs tell us that this is the intended way for users to access the matching substring.
Search for the first match of the regular expression r in s and return a RegexMatch object containing the match, or nothing if the match failed. The matching substring can be retrieved by accessing m.match and the captured sequences can be retrieved by accessing m.captures. The optional idx argument specifies an index at which to start the search.
Here is a possible alternative to regex.
It seems to be efficient from the example tested, but there might be a catch.
numberstrip1(c) = [parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", c)]
function numberstrip2(c)
v = split(c, 'S')[1:end-1]
i = findlast.(!isdigit, v)
return parse.(Int, [isnothing(j) ? u : u[j+1:end] for (j,u) in zip(i,v)])
end
c = repeat("9S8M13S1",1000)
@btime numberstrip1($c) # 503 μs (6008 allocs: 398 KiB)
@btime numberstrip2($c) # 232 μs ( 17 allocs: 288 KiB)
In fact, there are some potential unmanaged situations.
julia> ss ="S91S87M4SgS43SS"
"S91S87M4SgS43SS"
julia> numberstrip2(ss)
ERROR: ArgumentError: input string is empty or only contains whitespace
Stacktrace:
julia> digit_S(ss)
3-element Vector{Any}:
91
4
43
ss ="S91S87M4mSSgS4S3SMS"
digit_S(ss) == ns(ss)
This other scheme to test (to see if it works in “all” cases) seems faster.
Probably by reasoning more carefully it can be further simplified and made faster. Edited
function digit_S(s)
prev_digit=false
res=Int[]
v=0
for c in s
if isdigit(c)
v=v*10+codepoint(c)-0x30
prev_digit=true
elseif c=='S'&& prev_digit
push!(res,v)
v=0
prev_digit=false
else
v=0
prev_digit=false
end
end
res
end
julia> s ="91S87M4Sg43"
"91S87M4Sg43"
ns(s)=[parse(Int, m.match) for m in eachmatch(r"\d+(?=S)", s)]
julia> s10t = repeat(s,10^4)
...
julia> @btime digit_S($s10t);
261.600 μs (9 allocations: 326.55 KiB)
julia> @btime numberstrip2($s10t);
3.139 ms (20 allocations: 2.20 MiB)
julia> @btime ns($s10t);
5.430 ms (60010 allocations: 3.68 MiB)
julia> ns(s10t)==numberstrip2(s10t)==digit_S(s10t)
true
function digit_cuS(s)
prev_digit=false
res=Int[]
v=0
cus=codeunits(s)
for c in cus
if 0x30 <= c <= 0x39
v=v*10+c-0x30
prev_digit=true
elseif c==codepoint('S')&& prev_digit
push!(res,v)
v=0
prev_digit=false
else
v=0
prev_digit=false
end
end
res
end
julia> @btime digit_cuS($s10t);
195.300 μs (9 allocations: 326.55 KiB)
julia> @btime ns($s10t);
5.351 ms (60010 allocations: 3.68 MiB)
julia> function digit_cuS1(s)
prev_digit=false
res=Vector{Int}(undef,20000)
v=0
cus=codeunits(s)
i=1
for c in cus
if 0x30 <= c <= 0x39
v=v*10+c-0x30
prev_digit=true
elseif c==codepoint('S')&& prev_digit
res[i]=v #push!(res,v)
v=0
prev_digit=false
i+=1
else
v=0
prev_digit=false
end
end
res[1:i-1]
end
digit_cuS1 (generic function with 1 method)
julia> @btime digit_cuS1($s10t);
98.000 μs (4 allocations: 312.59 KiB)
parse.(Int,filter(!isempty,[ isnothing(findlast(!isdigit, e)) ? e : e[findlast(!isdigit, e)+1:end] for e in split(s10t,'S') ]))
But, how to make it works for multiple characters / a word for example before the word Freya:
(it can only works for one letter…)
s ="91S87M4Sg43S18FreyaS18S"
function codebreaker(s)
prev_digit=false
res=Vector{Int}(undef,20000)
v=0
cus=codeunits(s)
i=1
for c in cus
if 0x30 <= c <= 0x39
v=v*10+c-0x30
prev_digit=true
elseif c==codepoint('S')&& prev_digit
res[i]=v #push!(res,v)
v=0
prev_digit=false
i+=1
else
v=0
prev_digit=false
end
end
res[1:i-1]
end
# Type codebreaker(s)
@rocco_sprmnt21, you seem to have turned on the turbo on your Ferrari!
Thanks for spotting the limitation. I’ve done a little maintenance here on my Fiat Cinquecento, just trying to get to the destination in one piece:
numberstrip3()
function numberstrip3(c, pattern)
d = split(c, pattern)
pop!(d)
v = filter(!isempty, d)
ix = findlast.(!isdigit, v)
s = Int[]
for (i,u) in zip(ix,v)
if isnothing(i)
push!(s, parse(Int,u))
else
x = tryparse(Int, u[i+1:end])
!isnothing(x) && push!(s, x)
end
end
return s
end