Gotta love how this never ever fails
Yes, I wanted to try out StaticCompiler anyway, so here is how to make a statically compiled version of Elrod’s fastest suggestion:
using StaticCompiler
using StaticTools
function ispangram(argc::Int, argv::Ptr{Ptr{UInt8}})
if argc != 2
println(c"You should provide exactly one argument")
return Cint(-1)
end
a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
arg1 = unsafe_load(argv,2)
ichar = 1
while (b = unsafe_load(arg1,ichar)) != 0x00
if A ≤ b ≤ Z
appears |= one(UInt32) << (b-A)
elseif a ≤ b ≤ z
appears |= one(UInt32) << (b-a)
end
if appears == (-1%UInt32)
println(c"Is a Pangram")
return Cint(0)
end
ichar = ichar+1
end
println(c"Not a Pangram")
return Cint(0)
end
compile_executable(ispangram, (Int, Ptr{Ptr{UInt8}}), "./")
Here’s a correct version which avoids branching on the byte value, it gets a 2x speedup on my machine:
julia> function ispangram12a(input::String)
a = UInt8('a')
appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
for b in codeunits(input)
appears |= one(UInt32) << ((b | 0x20) - a)
appears == (-1%UInt32) && return true
end
false
end
ispangram12a (generic function with 1 method)
julia> @btime ispangram5($input)
121.657 ns (0 allocations: 0 bytes)
true
julia> @btime ispangram10($input)
116.376 ns (0 allocations: 0 bytes)
true
julia> @btime ispangram12a($input)
62.245 ns (0 allocations: 0 bytes)
true
It uses the | 0x20
trick to ensure that if the byte is an uppercase ASCII letter then it becomes a lowercase one. Then we unconditionally subtract 'a'
and shift. This works because anything which is not a letter turns into a shift of something more than 25 (since we’re using unsigned ints), and those have no effect because we already set all those bits in the original value of appears
.
I also removed the unnecessary @inbounds
which had no effect.
Below is a version which removes the branch to exit early and instead just checks once after the loop. It happens to be about the same speed for the given input (which is 410 characters) but for shorter inputs it can be significantly faster. For example on a 200-character string it is 2x faster.
julia> function ispangram12b(input::String)
a = UInt8('a')
appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
for b in codeunits(input)
appears |= one(UInt32) << ((b | 0x20) - a)
end
appears == (-1%UInt32)
end
ispangram12b (generic function with 1 method)
julia> @btime ispangram12a($input)
63.673 ns (0 allocations: 0 bytes)
true
julia> @btime ispangram12b($input)
66.496 ns (0 allocations: 0 bytes)
true
julia> @btime ispangram12a($(input[1:200]))
61.429 ns (0 allocations: 0 bytes)
true
julia> @btime ispangram12b($(input[1:200]))
31.791 ns (0 allocations: 0 bytes)
true
Edit: Interesting it is managing to process 6.2 bytes per nanosecond. I assume the compiler has automatically SIMD’d the loop?
To be fair with Python, which also included startup time in OP:
import sys
import timeit
def verifica_letras(string):
letras_usadas = set(string.lower())
letras_alfabeto = set('abcdefghijklmnopqrstuvwxyz')
return letras_usadas == letras_alfabeto
input = "wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"
v = verifica_letras(input)
print(timeit.timeit(lambda: verifica_letras(input), setup="pass", number=1_000_000))
# 3.89342975
which amounts to 3.9µs per execution.
using BenchmarkTools
verifica_letras(str) = Set(str |> lowercase) == Set("abcdefghijklmnopqrstuvwxyz")
ispangram(input) = ('a':'z' ⊆ lowercase(input))
function ispangram12b(input::String)
a = UInt8('a')
appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
for b in codeunits(input)
appears |= one(UInt32) << ((b | 0x20) - a)
end
appears == (-1%UInt32)
end
input = "wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"
@btime verifica_letras($input)
# 3.982 μs (17 allocations: 7.77 KiB)
@btime ispangram($input);
# 3.620 μs (10 allocations: 7.01 KiB)
@btime ispangram12b($input);
# 51.840 ns (0 allocations: 0 bytes)
# Mac Mini M1
Here Julia’s verifica_letras
code corresponds one-to-one to the python’s one, just converted to one-liner (ispangram
made one-liner, too). We see, in this case Julia and Python are equally fast.
What I like about Julia: Quite often I start writing a multiline function, then step-by-step remove unnecessary intermediate steps until I have just one (well-readable) LOC.
Isn’t the conclusion that the beautiful and simple 'a':'z' ⊆ lowercase(input)
is just fine, just that @Flavio_Barros did the timing of run- and compilation time jointly?
Yeah, when doing SIMD you normally want to do batches (e.g., 64 at a time) in between checks.
Yeah, it looks to me like we’ve nailed down the source of the time difference. Then, as is tradition for this community, a gang of wild Julians showed up and beat the problem into algorithmic submission to prove that it could be computed even faster if code complexity, readability, and robustness aren’t a concern.
And as usual I enjoy and learn a lot from the wild