Fast check if a character "is" an integer

lmiq · October 14, 2022, 7:20pm

I need to check if the first character of a string is an integer, quickly. I’m doing a try-catch:

julia> function c(name)
           i0 = try 
                parse(Int, name[1])
                true
           catch
                false
           end
           return i0
       end
c (generic function with 1 method)

julia> @btime c("1ABC")
  20.250 ns (0 allocations: 0 bytes)
true

julia> @btime c("ABC")
  230.796 μs (9 allocations: 440 bytes)
false

but this is painfully slow when the try fails. Is there a better way?

DNF · October 14, 2022, 7:25pm

Can’t you do something like

Int('0') <= first(name) <= Int('9')

?

Not sure if Int conversion is necessary.

Edit: Okay, I’m messing up something. But compare numerical value of characters.

rafael.guerra · October 14, 2022, 7:26pm

isdigit(first("1ABC"))

Dan · October 14, 2022, 7:27pm

Maybe you are looking for tryparse. Exactly like parse but returns nothing if not Int.

c(name) = isnothing(tryparse(Int, name[1:1]))

lmiq · October 14, 2022, 7:28pm

Thanks @rafael.guerra . That is quick.

DNF · October 14, 2022, 7:29pm

isdigit is The Way. I just want to say that it does the same thing as my suggestion

Dan · October 14, 2022, 7:36pm

In any case, this question raised an issue:

julia> parse(Int, '1')
1

while

julia> tryparse(Int, '1')
ERROR: MethodError: no method matching tryparse(::Type{Int64}, ::Char)
Closest candidates are:
  tryparse(::Type{T}, ::AbstractString; base) where T<:Integer at parse.jl:235
Stacktrace:
 [1] top-level scope
   @ REPL[221]:1

This should get a minor PR. Anyone else has thoughts?

Dan · October 14, 2022, 7:41pm

Of course, problem is difference between parse and tryparse

rafael.guerra · October 14, 2022, 7:43pm

Check this issue.

Palli · October 14, 2022, 7:49pm

Except it isn’t. Yours is slower, though both should be pretty fast…

rafael.guerra · October 14, 2022, 7:59pm

DNF’s proposal seems to me as quick, dropping the Int part:
'0' <= first(str) <= '9'

DNF · October 14, 2022, 8:01pm

I don’t have my computer, so I was just ‘waving in the right direction’.

stevengj · October 14, 2022, 11:32pm

This is definitely faster than the original method using try/catch with parse, and is clear.

However, if we’re playing the benchmarks game here and you want the fastest possible thing, I would avoid even the call to first(name), which requires extracting a whole Unicode character. Instead, at least for String, you could check only the first byte of the string, e.g.

# stupid optimization tricks:
c(name::Union{String,SubString{String}}) =
    UInt8('0') ≤ codeunit(name, 1) ≤ UInt8('9')

(And if this is in a context where you know for sure that the strings are non-empty, you could add @inbounds.)

Palli · October 17, 2022, 10:06pm

I tried with @inbounds on an empty string and I still got BoundsError, so I think it might be safe either way.

Anyway I think we might be spending way too much time optimizing an O(1) function…

But I’m interested in this since I plan to make my own string type. For the current one, is has a pointer to the heap for the string. Can that object be relied to exist, and if empty, at least the first byte be zero (are strings zero terminated?).

The code is much longer that I would have liked, can throw, and I would like false instead for empty strings:

@code_native @inbounds codeunit(@view(""[1:1]), 1)

Jake · October 17, 2022, 11:02pm

Some good ideas, if I want to expand the question to check if a string is a positive number , what would I do? I can build on the above to get

julia> all(isdigit.(collect("ABC123")))
false

julia> a = "123"
"123"

julia> all(isdigit.(collect(a))) && parse(Int, a) >= 0
true


julia> a = "-123"
"-123"

julia> all(isdigit.(collect(a))) && parse(Int, a) >= 0
false

Is this the best way to go about it?
Or Steven’s way?

julia> a = "123"
"123"

julia> all(UInt8('0') .<= codeunit.(a, 1:length(a)) .<= UInt8('9'))
true

DNF · October 18, 2022, 7:52am

Firstly, the above is quite inefficient. It collects the entire string into a vector of Chars, then runs isdigit on every single character, creating another temporary array, and then, finally, runs all on the output of this.

Instead, avoid collecting anything (in fact, if your code contains collect you are probably wasting time):

all(isdigit, "ABC123")

This will check the first character, and immediately bail out with zero allocations. Compare performance here:

julia> @btime all(isdigit.(collect($str)))
  82.505 ns (3 allocations: 176 bytes)
false

julia> @btime all(isdigit, $str)
  8.500 ns (0 allocations: 0 bytes)
false

Key to good performance is to avoid unnecessary intermediate allocations, and in the first example, you have got three of them.

As for the parsing part, I don’t understand why you need that at all. If all characters in a string are digits, then it’s a positive integer, so you only need

ispositiveinteger(str) = all(isdigit, str)

julia> ispositiveinteger("12345")
true

julia> ispositiveinteger("-12345")
false

Or do you need to check for whitespace, underscores, etc?

In fact, parse(Int, str) might ruin things, because of potential overflow:

julia> parse(Int, "9223372036854775808")
ERROR: OverflowError: overflow parsing "9223372036854775808"

julia> ispositiveinteger("9223372036854775808")  # this still works
true

DNF · October 18, 2022, 8:09am

The above will be slow, because of the needless allocations. The fastest I can come up with is, using the same codeunit trick is

all(x -> (0x30 <= x <= 0x39), codeunits(str))

This has no allocations (0x39 is the same as UInt8('0'), just shorter).

But I’m a bit wary of the whole codeunits thing, I never feel sure if there might be some way for some unicode value to trick the code. Does anyone know if this is completely safe? If it is safe, you can just use

ispositiveinteger(str) = all(x->(0x30 <= x <= 0x39), codeunits(str))

On a more general note:
Broadcasting (and collect) produces arrays (with some exceptions, like when using it on a tuple). If what you are trying to calculate is ultimately an array with one or more output elements per input element, then broadcasting is fine, just make sure to fuse the broadcasts (but don’t use collect, it is almost always bad.)

If, on the other hand, the value you want to calculate is a scalar, if it is a reduction over your input collection, then you should probably avoid broadcasting. Use sum(foo, vect), not sum(foo.(vect)); use all(foo, vect), not all(foo.(vect)), and so on, for minimum, maximum, any, count, etc. etc. And use loops where appropriate.

cjdoris · October 18, 2022, 10:29am

A nice property of UTF-8 is that any codeunit with the top bit unset (i.e. a codeunit < 128) is always an ASCII character, regardless of what came before. Hence the above code to check if a string is entirely ASCII digits is perfectly correct.

Palli · October 18, 2022, 11:01am

That gives true for the empty string, which I do not like, and ~~actually gives me “7.92 k allocations” a lot more than~~:

codeunit.(a, 1:length(a)) which is wrong, since the “length” in codeunits can be longer, I think it needs to be codeunit.(a, eachindex(a[1:end])). Is there a better way to do this, and faster, both have allocations, this one more.

DNF · October 18, 2022, 11:58am

That’s not nice, but:

this I cannot reproduce:

julia> @btime all(x -> (0x30 <= x <= 0x39), codeunits(""))
  6.600 ns (0 allocations: 0 bytes)
true

Topic		Replies	Views
Test if string begins with number New to Julia question	6	2240	May 4, 2019
Working with ASCII strings General Usage question	6	932	September 25, 2017
Converting Char type to Int type New to Julia	6	196	December 1, 2024
Performance of splitting string and parsing numbers Performance	29	877	December 29, 2022
How can i convert number string array to int array General Usage	4	5646	December 21, 2019

Fast check if a character "is" an integer

Related topics