Fast check if a character "is" an integer

I need to check if the first character of a string is an integer, quickly. I’m doing a try-catch:

julia> function c(name)
           i0 = try 
                parse(Int, name[1])
                true
           catch
                false
           end
           return i0
       end
c (generic function with 1 method)

julia> @btime c("1ABC")
  20.250 ns (0 allocations: 0 bytes)
true

julia> @btime c("ABC")
  230.796 μs (9 allocations: 440 bytes)
false

but this is painfully slow when the try fails. Is there a better way?

1 Like

Can’t you do something like

Int('0') <= first(name) <= Int('9')

?

Not sure if Int conversion is necessary.

Edit: Okay, I’m messing up something. But compare numerical value of characters.

1 Like

isdigit(first("1ABC"))

10 Likes

Maybe you are looking for tryparse. Exactly like parse but returns nothing if not Int.

c(name) = isnothing(tryparse(Int, name[1:1]))

Thanks @rafael.guerra . That is quick.

isdigit is The Way. I just want to say that it does the same thing as my suggestion :grin:

3 Likes

In any case, this question raised an issue:

julia> parse(Int, '1')
1

while

julia> tryparse(Int, '1')
ERROR: MethodError: no method matching tryparse(::Type{Int64}, ::Char)
Closest candidates are:
  tryparse(::Type{T}, ::AbstractString; base) where T<:Integer at parse.jl:235
Stacktrace:
 [1] top-level scope
   @ REPL[221]:1

This should get a minor PR. Anyone else has thoughts?

2 Likes

Of course, problem is difference between parse and tryparse

Check this issue.

1 Like

Except it isn’t. Yours is slower, though both should be pretty fast…

DNF’s proposal seems to me as quick, dropping the Int part:
'0' <= first(str) <= '9'

I don’t have my computer, so I was just ‘waving in the right direction’.

This is definitely faster than the original method using try/catch with parse, and is clear.

However, if we’re playing the benchmarks game here and you want the fastest possible thing, I would avoid even the call to first(name), which requires extracting a whole Unicode character. Instead, at least for String, you could check only the first byte of the string, e.g.

# stupid optimization tricks:
c(name::Union{String,SubString{String}}) =
    UInt8('0') ≤ codeunit(name, 1) ≤ UInt8('9')

(And if this is in a context where you know for sure that the strings are non-empty, you could add @inbounds.)

7 Likes

I tried with @inbounds on an empty string and I still got BoundsError, so I think it might be safe either way.

Anyway I think we might be spending way too much time optimizing an O(1) function…

But I’m interested in this since I plan to make my own string type. For the current one, is has a pointer to the heap for the string. Can that object be relied to exist, and if empty, at least the first byte be zero (are strings zero terminated?).

The code is much longer that I would have liked, can throw, and I would like false instead for empty strings:

@code_native @inbounds codeunit(@view(""[1:1]), 1)
1 Like

Some good ideas, if I want to expand the question to check if a string is a positive number , what would I do? I can build on the above to get

julia> all(isdigit.(collect("ABC123")))
false

julia> a = "123"
"123"

julia> all(isdigit.(collect(a))) && parse(Int, a) >= 0
true


julia> a = "-123"
"-123"

julia> all(isdigit.(collect(a))) && parse(Int, a) >= 0
false

Is this the best way to go about it?
Or Steven’s way?

julia> a = "123"
"123"

julia> all(UInt8('0') .<= codeunit.(a, 1:length(a)) .<= UInt8('9'))
true

Firstly, the above is quite inefficient. It collects the entire string into a vector of Chars, then runs isdigit on every single character, creating another temporary array, and then, finally, runs all on the output of this.

Instead, avoid collecting anything (in fact, if your code contains collect you are probably wasting time):

all(isdigit, "ABC123")

This will check the first character, and immediately bail out with zero allocations. Compare performance here:

julia> @btime all(isdigit.(collect($str)))
  82.505 ns (3 allocations: 176 bytes)
false

julia> @btime all(isdigit, $str)
  8.500 ns (0 allocations: 0 bytes)
false

Key to good performance is to avoid unnecessary intermediate allocations, and in the first example, you have got three of them.

As for the parsing part, I don’t understand why you need that at all. If all characters in a string are digits, then it’s a positive integer, so you only need

ispositiveinteger(str) = all(isdigit, str)

julia> ispositiveinteger("12345")
true

julia> ispositiveinteger("-12345")
false

Or do you need to check for whitespace, underscores, etc?

In fact, parse(Int, str) might ruin things, because of potential overflow:

julia> parse(Int, "9223372036854775808")
ERROR: OverflowError: overflow parsing "9223372036854775808"

julia> ispositiveinteger("9223372036854775808")  # this still works
true
1 Like

The above will be slow, because of the needless allocations. The fastest I can come up with is, using the same codeunit trick is

all(x -> (0x30 <= x <= 0x39), codeunits(str))

This has no allocations (0x39 is the same as UInt8('0'), just shorter).

But I’m a bit wary of the whole codeunits thing, I never feel sure if there might be some way for some unicode value to trick the code. Does anyone know if this is completely safe? If it is safe, you can just use

ispositiveinteger(str) = all(x->(0x30 <= x <= 0x39), codeunits(str))

On a more general note:
Broadcasting (and collect) produces arrays (with some exceptions, like when using it on a tuple). If what you are trying to calculate is ultimately an array with one or more output elements per input element, then broadcasting is fine, just make sure to fuse the broadcasts (but don’t use collect, it is almost always bad.)

If, on the other hand, the value you want to calculate is a scalar, if it is a reduction over your input collection, then you should probably avoid broadcasting. Use sum(foo, vect), not sum(foo.(vect)); use all(foo, vect), not all(foo.(vect)), and so on, for minimum, maximum, any, count, etc. etc. And use loops where appropriate.

1 Like

A nice property of UTF-8 is that any codeunit with the top bit unset (i.e. a codeunit < 128) is always an ASCII character, regardless of what came before. Hence the above code to check if a string is entirely ASCII digits is perfectly correct.

1 Like

That gives true for the empty string, which I do not like, and actually gives me “7.92 k allocations” a lot more than:

codeunit.(a, 1:length(a)) which is wrong, since the “length” in codeunits can be longer, I think it needs to be codeunit.(a, eachindex(a[1:end])). Is there a better way to do this, and faster, both have allocations, this one more.

That’s not nice, but:

this I cannot reproduce:

julia> @btime all(x -> (0x30 <= x <= 0x39), codeunits(""))
  6.600 ns (0 allocations: 0 bytes)
true