Julia slower than Python for finding pangrams?

I came up with a neat Julia solution using sets to solve that simple pangram. However, I’m a bit perplexed… C benchmarks show it running over 200 times faster, and even Python outperforms my Julia code substantially. Curious if anyone has insights into why this might be happening. Here’s my Julia implementation, along with the others:

In C

#include <stdio.h>
#include <stdlib.h>

#define ALPHABET_SIZE 26

int isPangram(char* str) {
    int alphabet[ALPHABET_SIZE] = {0};

    for (int i = 0; str[i] != '\0'; i++) {
        if (str[i] >= 'A' && str[i] <= 'Z') {
            alphabet[str[i] - 'A'] = 1;
        } else if (str[i] >= 'a' && str[i] <= 'z') {
            alphabet[str[i] - 'a'] = 1;
        }
    }

    for (int i = 0; i < ALPHABET_SIZE; i++) {
        if (alphabet[i] == 0) {
            return 0;
        }
    }

    return 1;
}

void main(int argc, char** argv) {   

    if (argc == 0) {
	    
	    printf("Provide at least one letter!");
	    exit(0);
    }

    int i;

    for(i = 1; i < argc; i++) {

       if (argc < 1) break;

       if (isPangram(argv[i])) {
           printf("yes, it's a pangram \n");
       } else {
           printf("no, it's not a pangram \n");
       }
    }

    exit(0);
}

In Python

import sys

def verifica_letras(string):
   letras_usadas = set(string.lower())
   letras_alfabeto = set('abcdefghijklmnopqrstuvwxyz')
   return letras_usadas == letras_alfabeto

if len(sys.argv) <= 1: 
    print("Provide at least one letter!")
    sys.exit()

if verifica_letras(sys.argv[1]): 
    print("yes, it's a pangram \n")
else:
    print("no, it's not a pangram \n")

In Julia

function ispangram(input)
	'a':'z' ⊆ lowercase(input) ? println("Yes, is a pangram.") : println("Not, it's not a pangram.")
end

length(ARGS) > 0 ? ispangram(ARGS[1]) : println("Provide at least one letter!")

Benchmarks with hyperfine - C

> hyperfine --shell=none --warmup 30 "./a.out wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"

Result

Time (mean ± σ): 701.0 µs ± 152.1 µs [User: 525.3 µs, System: 67.9 µs]
Range (min … max): 523.8 µs … 3765.6 µs 3569 runs

Benchmarks with hyperfine - Python

> hyperfine --shell=none --warmup 30 "python3 pangram.py wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"

Result

Time (mean ± σ): 15.9 ms ± 3.8 ms [User: 11.6 ms, System: 3.8 ms]
Range (min … max): 12.4 ms … 37.7 ms 210 runs

Benchmarks with hyperfine - Julia

hyperfine --shell=none --warmup 30 "julia --compile=all pangram.jl wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"

Result

Time (mean ± σ): 266.0 ms ± 23.4 ms [User: 228.5 ms, System: 112.0 ms]
Range (min … max): 240.1 ms … 314.1 ms 12 runs

Puzzled

C is expectedly much faster, but I’m surprised at the performance gap between my Python and Julia implementations. Python finished in 15 ms, while Julia took more than 17 times that. Could anyone familiar with both languages help me understand why Julia might be lagging here?

Hi! I took the liberty to rename your post to make it more specific. Very generic titles like “Julia slower than Python!!!” have the tendency to attract pile-up, and we only need a few expert opinions here :slight_smile:

5 Likes

Fair…I am new around, sorry for that.

No worries! And now we get to sit back and watch while everyone you nerdsniped comes together to find a blazingly fast Julia implementation :wink:

Although the first thing I would say is that you’re including startup and compilation time in your benchmark, which is typically the wrong way to go about this. If you plan to run your function many times (as is typical for most functions), this one-time cost becomes negligible, and so you shouldn’t measure it. In Python it’s not too visible, but Julia has much higher latency

4 Likes

You are measuring startup time, and Julia (still, despite major improvements) is known to have a slow startup time: it compiles your code every time you run a new Julia process. Basically,

is not how you should run Julia programs, at least not if the runtime of pangram.jl is less than a few minutes. Instead, you start a REPL / IJulia kernel or some other long running process and then run / include your Julia code in that.

In this case, in a new Julia REPL:

julia> function ispangram(input)
               return ('a':'z' ⊆ lowercase(input))
       end
ispangram (generic function with 1 method)

julia> input = "wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"
"wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwer
tyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"

julia> ispangram(input)
true

julia> using BenchmarkTools

julia> @benchmark ispangram($input)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.641 μs … 245.906 μs  ┊ GC (min … max): 0.00% … 96.62%
 Time  (median):     4.031 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.366 μs ±   7.129 μs  ┊ GC (mean ± σ):  5.55% ±  3.35%

            ▃▆█▄▂
  ▂▅█▆▄▄▆▆▅▆█████▇▇▆▄▄▄▄▄▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  3.64 μs         Histogram: frequency by time        5.41 μs <

 Memory estimate: 7.01 KiB, allocs estimate: 10.
12 Likes

Using more explicit for loop:

function ispangram3(input)
    appears = falses(26)
    for c in input
        isletter(c) || continue
        appears[lowercase(c)-'a'+1] |= true
    end
    return all(appears)
end

is a bit faster.

1 Like

Yeah, but this is ~100% startup/compilation time. It’s kinda like timing:

hyperfine --shell=none --warmup 30 "cc pangram.c && ./a.out wjjqevffkkgbcehhiqpvqutmwxawzvjnbvukmlzxyhkgfddzfjhcujnlkjbdfgghjhujkiuytghjioplkjhgfdsaqwertyujioplkjhgfdsaqwertzuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsazxcvbnmlkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkjhgfdsaqwertyuioplkj"`

You might be saying “well don’t do that” — and that’s exactly what Julians will say about timing julia pangram.jl like that, too.

The good news is that we’re getting closer to having a real juliac that can output compiled stand-alone executables with significantly lower startup times like cc does.

31 Likes

Since you are only looking for ASCII characters, it’s even faster still to search the raw bytes of the UTF-8 encoding, via codeunits(input), analogous to the C implementation.

function ispangram4(input::String)
    a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
    appears = fill(false, 26) # Vector{Bool} is faster than BitVector falses(26)
    @inbounds for b in codeunits(input)
        if A ≤ b ≤ Z
            appears[b-(A-0x01)] = true
        elseif a ≤ b ≤ z
            appears[b-(a-0x01)] = true
        end
    end
    return all(appears)
end

Note that a Vector{Bool} is faster to work with than a BitVector, which is why I didn’t use falses(26) here. You could save a bit more time by pre-allocating the appears array, though hopefully someday Julia will stack-allocate it automatically (Move small arrays to the stack by pchintalapudi · Pull Request #43573 · JuliaLang/julia · GitHub).

But the first-order issue here is how you do the benchmarking, as commented above.

8 Likes

I appreciate all the optimization suggestions! My surprise was that simple Python and C solutions outperformed my initial Julia code. I realize now that my Julia speed test wasn’t accurate.

6 Likes

Bitsets allow for efficient early stopping. Here, I manually use a UInt32 as a bitset.

julia> function ispangram5(input::String)
           a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
           appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
           @inbounds for b in codeunits(input)
               if A ≤ b ≤ Z
                   appears |= one(UInt32) << (b-A)
               elseif a ≤ b ≤ z
                   appears |= one(UInt32) << (b-a)
               end
               appears == (-1%UInt32) && return true
           end
           false
       end
ispangram5 (generic function with 1 method)

julia> @btime ispangram4($input)
  470.774 ns (1 allocation: 80 bytes)
true

julia> @btime ispangram5($input)
  123.972 ns (0 allocations: 0 bytes)
true

Lack of allocations can also make a difference for shorter inputs

julia> @btime ispangram4("abcdefghijklmnopqrstuvwxyz")
  57.158 ns (1 allocation: 80 bytes)
true

julia> @btime ispangram5("abcdefghijklmnopqrstuvwxyz")
  34.608 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram4("abcdefghijklmnopqrstuvwxy")
  55.704 ns (1 allocation: 80 bytes)
false

julia> @btime ispangram5("abcdefghijklmnopqrstuvwxy")
  33.771 ns (0 allocations: 0 bytes)
false

With some more effort, we could SIMD this…

10 Likes

We can shave this one a bit with a mask:

julia> mask = UInt32(2^26 - 1)
0x03ffffff

julia> function ispangram6(input::String)
                  a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
                  appears::UInt32 = 0
                  @inbounds for b in codeunits(input)
                      if A ≤ b ≤ Z
                          appears |= one(UInt32) << (b-A)
                      elseif a ≤ b ≤ z
                          appears |= one(UInt32) << (b-a)
                      end
                      iszero(appears ⊻ mask) && return true
                  end
                  false
              end
ispangram6 (generic function with 1 method)

julia> @btime ispangram5($input)
  82.210 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram6($input)
  3.135 μs (91 allocations: 1.42 KiB)
true

julia> ispangram6("abcdeghijklmnopqrstuvwxyz")
false
1 Like

Belatedly realized that iszero(a ⊻ b) is just a fancy way of saying a == b, that bums a few more cycles.

julia> function ispangram7(input::String)
                  a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
                  appears::UInt32 = 0
                  @inbounds for b in codeunits(input)
                      if A ≤ b ≤ Z
                          appears |= one(UInt32) << (b-A)
                      elseif a ≤ b ≤ z
                          appears |= one(UInt32) << (b-a)
                      end
                      appears == mask && return true
                  end
                  false
              end
ispangram7 (generic function with 1 method)

julia> @btime ispangram7($input)
  1.300 μs (92 allocations: 1.44 KiB)
true

julia> ispangram7("abcdeghijklmnopqrstuvwxyz")
false

The allocations are presumably because the mask is global?

1 Like

Yes:

julia> function ispangram8(input::String, mask::UInt32)
                  a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
                  appears::UInt32 = 0
                  @inbounds for b in codeunits(input)
                      if A ≤ b ≤ Z
                          appears |= one(UInt32) << (b-A)
                      elseif a ≤ b ≤ z
                          appears |= one(UInt32) << (b-a)
                      end
                      appears == mask && return true
                  end
                  false
              end
ispangram8 (generic function with 2 methods)

julia> @btime ispangram8($input, $mask)
  97.457 ns (0 allocations: 0 bytes)

Edit: so I did, just now, notice that I was comparing nanoseconds with microseconds earlier. Oops.

2 Likes

That’s still slower than ispangram5?
But we should just be testing vs 0, except that’s slower for me, too

julia> @btime ispangram5($input)
  85.659 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram9($input)
  123.144 ns (0 allocations: 0 bytes)
true

Which isn’t surprising

	cmp	ecx, -1
	jne	.LBB0_10

macro-op fuses into one uop, so we shouldn’t really expect some other comparison to be noticeably faster.

5 Likes

Yeah, wasn’t paying attention there.

This, on the other hand, just barely squeezes past it:

julia> @inline shift_safe(::Type{T}, s::Integer) where {T} = s & (8 * sizeof(T) - 1)
shift_safe (generic function with 1 method)

julia> function ispangram9(input::String, mask::UInt32)
                  a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
                  appears::UInt32 = 0
                  @inbounds for b in codeunits(input)
                      if A ≤ b ≤ Z
                          appears |= one(UInt32) << shift_safe(UInt32, b-A)
                      elseif a ≤ b ≤ z
                          appears |= one(UInt32) << shift_safe(UInt32, b-a)
                      end
                      appears == mask && return true
                  end
                  false
              end
ispangram9 (generic function with 1 method)

julia> @btime ispangram8($input, $mask)
  97.501 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram9($input, $mask)
  80.317 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram5($input)
  82.212 ns (0 allocations: 0 bytes)

Borrowed that from BitPermutations.jl, credit where credit is due.

But of course, combining this with pangram5 is the real win:

julia> function ispangram10(input::String)
                  a,A,z,Z = UInt8('a'),UInt8('A'),UInt8('z'),UInt8('Z')
                  appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
                  @inbounds for b in codeunits(input)
                      if A ≤ b ≤ Z
                          appears |= one(UInt32) << shift_safe(UInt32, b-A)
                      elseif a ≤ b ≤ z
                          appears |= one(UInt32) << shift_safe(UInt32, b-a)
                      end
                      appears == (-1%UInt32) && return true
                  end
                  false
              end
ispangram10 (generic function with 1 method)

julia> @btime ispangram10($input)
  57.249 ns (0 allocations: 0 bytes)
3 Likes

I wonder if StaticCompiler.jl could be applied here.

4 Likes

This if is very cumbersome in source and perhaps generates branching code. Since we are dealing with UInt8s we might as well do bitwise lowercasing:

b2 = b | 0x20

and use b2 for appears.

4 Likes

Good call! There’s no need for a branch at all, because ASCII is very cleverly designed to help us out.

julia> function ispangram11(input::String)
                         appears::UInt32 = ~((one(UInt32)<<26)-one(UInt32))
                         @inbounds for b in codeunits(input)
                             appears |= one(UInt32) << shift_safe(UInt32, (b & 0x9f) - 1)
                             appears == (-1%UInt32) && return true
                         end
                         println(bitstring(appears))
                         false
                     end
ispangram11 (generic function with 1 method)

julia> @btime ispangram11($input)
  45.202 ns (0 allocations: 0 bytes)
true

julia> @btime ispangram10($input)
  57.266 ns (0 allocations: 0 bytes)
true
3 Likes

That’s not correct if there are characters other than ASCII letters. e.g.

julia> b = UInt8('\r'); (b & 0x9f) - 1
12
3 Likes