Perf Julia@threads behind Numba@njit(parallel=True)

(This is a follow up to Julia vs Numba, having created a new discourse account)

I came across this article on the performance of Julia vs Python+Numba, and this section interested me. It claims parallel Julia code is quite slower than Numba code. I didn’t want to trust it just yet, so I did my own benchmark (8 threads i5-11320H @ 3.2GHz):

# julia -t8 loop_julia.jl

using BenchmarkTools

function loop_julia(a::Vector{Float32}, b::Vector{Float32})
    x::Vector{Float32} = zeros(Float32, length(a))

    Threads.@threads for i in 1:length(a)
        if a[i] < b[i]
            x[i] = 0.
        else
            x[i] = 1.
        end
    end

    return x
end

function main()
    a = randn(Float32, 1_000_000)
    b = randn(Float32, 1_000_000)
    @btime loop_julia($a, $b)
end

main()

The result is 1.44ms average.

# python loop_numba.py

import timeit
import numpy as np
from numba import njit, prange

@njit(parallel=True)
def loop_numba(a, b):
    x = np.zeros(a.shape, dtype=np.float32)

    for i in prange(a.shape[0]):
        if a[i] < b[i]:
            x[i] = 0.
        else:
            x[i] = 1.

    return x

def main():
    setup = "a = np.random.randn(1_000_000).astype(np.float32); b = np.random.randn(1_000_000).astype(np.float32)"
    t = timeit.timeit("loop_numba(a, b)", setup, globals=globals(), number=10000)
    print(t / 10000 * 1000)

main()

The result is 0.83ms average.

The claims in the article are not wrong, Julia is indeed behind in performance when it comes to parallel computing. But where does the performance difference come from? This is a staggering x2 performance difference, which I didn’t expect before the benchmark.

Although optimized Numba code is a bit like C code, as I recall someone on discourse. But both tools using LLVM+multi-threading having a x2 perf difference should be due to something else.

Adding @inbounds gives the Julia code a 2x speedup; that might explain the difference if Numba doesn’t do bounds checking.

that assignment in the loop hides a conversion since that 0. is a Float64 not Float32.
Also you should use eachindex, which should get rid of bounds checking.

What do you get with this version?

function loop_julia2(a::Vector{Float32}, b::Vector{Float32})
    x = similar(a)

    Threads.@threads for i in eachindex(a, b, x)
        if a[i] < b[i]
            x[i] = zero(Float32)
        else
            x[i] = one(Float32)
        end
     end

     return x
end

I didn’t compare with numba (CBA to set it up), but for for reference on my machine I get this

julia> main_single();
  4.450 ms (3 allocations: 3.81 MiB)

julia> main2_single();
  832.200 μs (3 allocations: 3.81 MiB)

julia> main();
  1.694 ms (45 allocations: 3.82 MiB)

julia> main2();
  1.669 ms (45 allocations: 3.82 MiB)

So the single threaded version 2 is ~4x faster. It’s even 2x faster than the threaded version.
The fact that the two threaded versions take exactly the same amount of time makes me think that that’s just the overhead of spinning up threads. Which in julia is rather heavy, and you need enough work to make it worth it. in this case it’s clearly not enough.

In your benchmark you’re probably measuring task creation time and
probably numba just has a lighter threading model, which in this case pulls ahead.

The conversion in the loop doesn’t make any difference, the compiler is smart enough to handle it.

you’re right! well, it’s just good practice then.
I should stop underestimating the julia compiler.

I see, the fact that Julia has bounds checking doesn’t bother me, but allowing me to disable it is quite annoying! I had messed around with @inbounds before in another project, and the performance gain is always ambiguous. I had no idea why it gave me a speedup somewhere but not the other, it’s just not immediately obvious why and how much would it improve my code. Thanks for the help.

Using Polyester.jl can leverage this overhead:

julia> function loop_threads(a::Vector{Float32}, b::Vector{Float32})
           x = similar(a)

            Threads.@threads for i in eachindex(a, b, x)
               if a[i] < b[i]
                   x[i] = zero(Float32)
               else
                   x[i] = one(Float32)
               end
            end

            return x
       end
loop_threads (generic function with 1 method)

julia> function loop_polyester(a::Vector{Float32}, b::Vector{Float32})
           x = similar(a)

            Polyester.@batch for i in eachindex(a, b, x)
               if a[i] < b[i]
                   x[i] = zero(Float32)
               else
                   x[i] = one(Float32)
               end
            end

            return x
       end
loop_polyester (generic function with 1 method)

julia> function main()
           a = randn(Float32, 1_000_000)
           b = randn(Float32, 1_000_000)
           @info loop_threads(a, b) == loop_polyester(a, b)
           @btime loop_polyester($a, $b)
           @btime loop_threads($a, $b)
           nothing
       end
main (generic function with 1 method)

julia> main()
[ Info: true
  146.727 μs (3 allocations: 3.81 MiB)
  1.385 ms (25 allocations: 3.82 MiB)

It would be quite interesting if Julia threading is heavier than Numba, considering Numba could call external TBB/OMP for threading, yet the overhead is still lower than Julia somehow.

Aha! Numba does disable bounds checking by default, so the difference is explained: Just-in-Time compilation — Numba 0+untagged.855.g9e3087a.dirty documentation

If you want you can enable them though.

That makes perfect sense. May I ask something related? I used to work on a computation centric function, I had better results using @inbounds on specific loops, but worse performance when applying it to the whole function. Did I mistake how the macro works? Thanks!

Numba does not check bounds by default for performance, and you have to opt back into it if you can’t design memory-safe indexing otherwise. It deviates from indexing semantics in base Python and NumPy, but it’s consistent with the lack of bounds-checking in NumPy functions, which are designed to never index out of bounds.

@inbounds isn’t a simple guarantee of less work at runtime, that’s why it’s put in our hands. All @inbounds does is remove compiler-inlined @boundscheck blocks, which are intended to check bounds prior to indexing. If those blocks take very little time compared to indexing, then we save very little time. Without the guarantee of indexing safety, the compiler could be less willing to optimize our code. There are other macros with more requirements for optimizing loops.

While everything else is very informative, this part I’m a bit confused. Isn’t @inbounds meant to be a high risk high reward (not very high, I mean) option? Is Julia bounds checking done on the Julia compiler level, or is it done by LLVM? If it is the former I see no reason why we can’t just remove the Julia injected code for bounds checking and assert to the LLVM compiler that all is fine.

Julia’s multithreading is task-based, like Go. If you have threaded code that calls threaded code, it all enters a task pool so that the number of parallel executed tasks matches the number of threads.

workqueue, the default threading model of Numba, does not have any hierarchical support. This makes it probably have the lowest overhead (would need to test, but in theory if it’s not awful it should), but this means you’d get thrashing in many scenarios. For example, would mean a 32-thread program that calls a 32-thread program would have 1024 concurrent tasks that all thrash the OS and ultimately slows it down. Julia’s code would work just fine in this scenario.

While OpenMP does have some sense of hierarchical support, i.e. it can handle nesting parallel loop constructs, it cannot handle tasks outside of the hierarchy. For example, if you write some recursive BLAS routine that would want to use all threads, and then separately run some HTTP server, those would have separate threading pools, and this would result in thrashing.

TBB does have a work-sharing system, though Numba does not expose the full task graph functionality like Julia so you can’t really build things that requires arbitrary task scheduling. Standard numerical solving routines tend to be okay without this, while doing something like writing an HTTP server tends to want to have arbitrary task scheduling.

Of course, supporting more generality will always give more overhead. So testing Julia against workqueue is probably going to have Numba looking faster. The Julia equivalent to that is Polyester.jl, which this thread (pun intended) shows is a lot faster. OpenMP is a bit of a middle ground, and then TBB is general but (a) Numba doesn’t expose most of the features advanced cases would need to use, and (b) it’s not what is being benchmarked here, and would definitely introduce overhead. As much as Julia’s? I don’t know, but it would require more benchmarking

Julia’s threading could use some optimizations, don’t get me wrong, I know what I’d like to tweak. But this right here is very misleading: yes Numba does support TBB. But no, benchmarking with TBB off like the 2023 article you linked to and what you benchmarked in the thread (presumably, since you didn’t mention anything special about the build changing to TBB) does not give a very good indication of performance with TBB. What’s effectively happening here is you’re benchmarking with the Julia-equivalent of Polyester.jl, seeing the low overhead, and then saying “there’s a version with more features, and thus it has the features and the performance”, when in reality changing the threading backend will heavily effect the threading overhead.

Great information! I defintely overlooked the effect of the threading model behind the two approaches. I would look into Polyester.jl. (Actually, why is it named polyester? I’m just curious)

It’s very cheap threads. Pun intended.

But of course the caveat is the Polyester’s threads do not nest like the normal Julia threads. It effectively uses the same backend but sets the parameters in such a way that it opts out of being able to nest (by keeping the threads alive).

Ohhh, I focused on the material but forgot the original definition of threads. Lol.

Would you mind recommending a read on thread nesting? I’ve never heard of it before, or maybe I just didn’t know the name for it.

(Nevermind, is thread nesting just threading while using loops to spawn more tasks? Correct me if I’m wrong)

Bounds-checking is implemented in Julia source, we can read it (start here for arrays). If we design our own indexable types, we can implement our own bounds-checking as well.

@boundscheck and @inbounds are different. The macros are implemented in Julia source, but they just tag the input code with special expressions. A compilation step might remove bounds-checking where the tags are. Putting this in the compiler’s hands allows the process’s command-line flag --check-bounds to change whether @inbounds even works.

There isn’t an equivalent system in Numba because we can’t design another indexable type for NumPy functions to work on, only use NumPy arrays as components for new types.

That’s hierarchical, and that’s what OpenMP supports. TBB and Julia’s PARTR schedule support more general threadpool sharing. Some reading:

https://dl.acm.org/doi/10.1145/1248377.1248396

We dissect TBB activities into four basic categories and
show that the runtime library can contribute up to 47%
of the total per-core execution time on a 32-core system.
While this overhead is much lower at low core counts, it
hinders performance scalability by placing a core count
dependency on performance.

And for good measure, Cilk is another system in this space:

https://dl.acm.org/doi/10.1145/209937.209958

To compare:

  • TBB: Child-Stealing work scheduling, tasks defined statically
  • Cilk: Continuation-Stealing work scheduling, task tasks defined statically
  • Julia PARTR: Parallel Depth-First scheduling, fully dynamic tasks
  • Go coroutines: Child-Stealing work scheduling, fully dynamic tasks

So if I’m not mistaken, TBB and Cilk do require tasks to be known statically at compile time, while Julia and Go tasks can be fully dynamically created (for example, due to I/O events). All of that has trade-offs of course.