Julia vs Numba

lilachint · May 14, 2026, 4:03am

I used to use Numpy+Numba for some parallel computations with @nb.njit and nb.prange in Python. How does Julia compare to that? In the Python version, I used TBB/OMP threading layers and some optimization to speed up the computation from many minutes to sub-second. AFAIK, both are using LLVM under the hood, so would Julia be faster, and if yes/no, why? Thanks!

yolhan_mannes · May 14, 2026, 5:19am

I think this should be mark off topic instead of performance.

The question is quite large, and the only answer is “it depends” but I encourage you to build a full benchmark.

Sometimes we win over numba sometimes not, the reason is we give more info to the compiler than python but llvm is vast and numba may do something nice somewhere or use optimisation we refused to do for some reason.

As for the threading part I don’t think we have any composity layer with Intel threading, you could look at OneAPI.jl but it’s not that used or OpenCL.jl which may help ? Before that you would need to check if Julia threading system isn’t enough though.

Finally I want to say, we don’t tell people “even if you love python move to Julia”, I’m pretty sure we never did, if you’re fully ok with your python code I don’t see a reason to use Julia beside curiosity. You’re still very welcome of course and if you want to optimise something a lot more using numba as a proof of concept for the performance it’s fully ok too and people will help.

lilachint · May 14, 2026, 5:52am

Marked as off topic. I would build a full implementation of the same tool later for a comprehensive benchmark. On threading, TBB allows dynamic scheduling for nested loops with different loads per inner loop, I would look up if Julia has similar things.

ufechner7 · May 14, 2026, 6:03am

When I tested Numba, which was some years ago, small Julia functions were 10 times faster than small Numba functions because Julia aggressively inlines, so there’s no call overhead.

But the main advantage of Julia is the readability and maintainability:

Numba code looks like Python, but behaves very differently. That causes a lot of pain for people who have to read and maintain Python projects that use Numba.

lilachint · May 14, 2026, 6:51am

True. And Numba’s compatibility with other tools isn’t that good either, building binaries for a tool that uses Numba is so painful that I just chose PyPI afterall. Julia probably doesn’t have such problems.

Benny · May 14, 2026, 6:51am

LLVM is only a fraction of any compiler, so this doesn’t really say much about performance. There would technically be a performance ceiling that one language implementation could reach for a given program, but we could easily write a LLVM-based language that is very slow. Numba did a lot of work on the front-end to allow LLVM optimizations down the line, even deviating from Python semantics in a few key ways.

But we can still compare performance of programs of course. I’d agree with yolhan_mannes that it’s a mixed bag in practice, and part of the mixed bag is that equivalent programs aren’t even easy to write. I tried to port a Julia program with a few custom nested structs and scalar functions to Numba-accelerated Python one time, and I could only manage a part of it with a significant distortion of the API and Numba usage. The developers of Python, Numpy, or Numba are not at fault for not providing the same features as Julia, just as Julia is not at fault for not providing inherited classes; they’re just different ways of instructing the machine. In my opinion, Julia has a large advantage in features because it was designed for a JIT while Numba is retrofitted to Python and NumPy, but if I needed to accelerate a Python project, I’d first reach for Numpy+Numba before interop with Julia or porting to Julia.

As for CPU multithreading, I like how base Julia works a lot more than base Python: normal functions fully integrated into asynchronous tasks on heavier OS threads on multiple CPU cores, and I didn’t need to hear about 3+ different frameworks on the way. But that’s not what Numba does, which is more bolting C/C++ frameworks onto JITed Python+NumPy. I think they did a great job there, and your speedup from >1min to <1sec is a clear win unlike my port attempt. I also really like the simplicity of Numba’s high level parallelization API over loops and array operations. Julia does have high level parallelization like that, or rather my problem is that there are MANY such third-party libraries (here’s a chapter about just a few) with very little discussion on where the most updated development is; it’s similar to my problem with base Python’s parallelism, but worse in some ways. There is a good argument that these options give us more control, but at this point I prefer fewer, longer, unnested tasks (aside: this needs base Python, probably concurrent.futures.ProcessPoolExecutor, for separate no-GIL Numba functions) or the specific finer parallelism built into LinearAlgebra’s BLAS or CUDA.jl when I can, which is quite often because I’m not doing anything particularly sophisticated.

I think this is kind of understating what TBB does. The superficial answer is “yes, Julia’s task scheduler does that”, but TBB is a whole different framework. For just one example, work stealing is a major aspect, while it hasn’t landed in Julia yet.

lilachint · May 14, 2026, 6:59am

Thanks for the comprehensive reply!

Welp, I feel like Julia is Lisp-cursed, in the sense that little centralised effort is in building mainstream libraries (apart from SciML and a few) but is spread across (perhaps unmaintained) individual projects. Although someone says this aligns with Julia’s philosophy, I don’t feel too optimistic about going along this path.

I checked the wikipedia page on this one, which says the two main approaches to dynamic scheduling are work stealing and work sharing. Does TBB use the latter?

Benny · May 14, 2026, 7:29am

Not that I’m aware of, but frankly that doesn’t mean much because I don’t really know how multithreading works. It’s probably more work to learn than the Julia ecosystem’s parallelization options. I also can’t really give insight on base Julia’s multithreading. For example, work sharing and work stealing are both broad approaches to balancing tasks across threads and processors; I have no idea where Julia’s thread-migrating tasks fall. It seems to do a good job of keeping my cores busy when I eyeball the Windows Task Manager, so I’m not sure how work stealing could improve that. Maybe it’s just more of an issue for the overheads of managing numerous shorter tasks, which is not something I have to deal with.

yolhan_mannes · May 14, 2026, 7:52am

I tried implementing https://m.youtube.com/watch?v=4OfJa3SfU84 unless I did a mistake we’re 32X faster here on a big enough code

using Statistics

const T = Float32

struct Ackley
    x_range ::Tuple{T,T}
    y_range ::Tuple{T,T}
    minima ::T
    location ::Tuple{T,T}
end
function Ackley()
    x_range = (T(-5),T(5))
    y_range = (T(-5),T(5))
    minima = T(0)
    location = (T(0),T(0))   
    return Ackley(x_range,y_range,minima,location) 
end

@fastmath function Query(A::Ackley,x::T, y::T) ::T
    z = -20*exp(-T(0.2)*sqrt(T(0.5)*(x*x+y*y))) - exp(T(0.5)*(cospi(2*x)+cospi(2*y))) + T(ℯ) + 20
    return z
end


function Initialise_Population(Population_size,optimiser)
    Population = Matrix{T}(undef,2,Population_size)
    fitness = Vector{T}(undef,Population_size)
    for i in 1:Population_size
        Population[1,i] = (optimiser.x_range[2]-optimiser.x_range[1]) * rand(T) + optimiser.x_range[1]
        Population[2,i] = (optimiser.y_range[2]-optimiser.y_range[1]) * rand(T) + optimiser.y_range[1]
        fitness[i] = Query(optimiser,Population[1,i] , Population[2,i])
    end
    return Population, fitness
end

struct Bat_algorithm
    x_range ::Tuple{T,T}
    y_range ::Tuple{T,T}
    Population_size ::Int32
    Population ::Matrix{T}
    fitness ::Vector{T}
    Loudness ::Vector{T}
    Pulse_Rate ::Vector{T}
    Frequency ::Matrix{T}
    Velocity ::Matrix{T}
    Movements ::Int32 
    Frequency_Range ::Tuple{T,T}
    Loudness_Decay ::T
    Loudness_Limit ::T
    Pulse_Rate_Decay ::T
    Gamma ::T
    Best_Position ::Vector{T}
    Best_fitness ::Base.RefValue{T}
    Best_Bat ::Base.RefValue{Int32}
end
function Bat_algorithm(optimiser; Population_size = Int32(100),Num_Movements=Int32(100))
    x_range = optimiser.x_range
    y_range = optimiser.y_range
    Population, fitness =  Initialise_Population(Population_size, optimiser)
    Loudness = ones(T,Population_size)
    Pulse_Rate = zeros(T,Population_size)
    Frequency = zeros(T,(2,Population_size))
    Velocity = ones(T,(2,Population_size))
    Movements = Num_Movements
    Frequency_Range = (T(0),T(1))
    Loudness_Decay = T(0.5)
    Loudness_Limit = T(0.05)
    Pulse_Rate_Decay = T(0.5)
    Gamma = T(0.5)
    Best_Position = Vector{T}(undef,2)
    Best_fitness = Ref(T(0))
    Best_Bat = Ref(Int32(0))
    return Bat_algorithm(
        x_range,
        y_range,
        Population_size,
        Population,
        fitness,
        Loudness,
        Pulse_Rate,
        Frequency,
        Velocity,
        Movements,
        Frequency_Range,
        Loudness_Decay,
        Loudness_Limit,
        Pulse_Rate_Decay,
        Gamma,
        Best_Position,
        Best_fitness,
        Best_Bat
    )
end

function update_dynamics!(Position,B::Bat_algorithm,bat,best)
    nf = B.Frequency_Range[1] - (B.Frequency_Range[2]- B.Frequency_Range[1])*rand(T)
    for i in 1:2
        nv = (B.Population[i,best] - B.Population[i,bat] ) *nf
        Position[i] = B.Population[i,bat] + nv
        B.Frequency[i,bat] = nf
        B.Velocity[i,bat] = nv
    end
    return Position
end

function run(B::Bat_algorithm, optimiser)
    best_fit, best_index = findmin(B.fitness)
    Position = T[0,0]
    for step in 2:B.Movements+1
        for bat in 1:B.Population_size
             #update the position of the bat 
            update_dynamics!(Position,B,bat,best_index)
            r1 = rand(T)
            r2 = rand(T)*2 - 1
            m = mean(B.Loudness)
            if r1 < B.Loudness[bat]
                for i in eachindex(Position)
                    Position[i] += m + r2 
                end
            end
            if Position[1] < optimiser.x_range[1]
                Position[1] = optimiser.x_range[1]
            elseif Position[1] > optimiser.x_range[2]
                Position[1] = optimiser.x_range[2]
            end
            if Position[2] < optimiser.x_range[1]
                Position[2] = optimiser.x_range[1]
            elseif Position[2] > optimiser.x_range[2]
                Position[2] = optimiser.x_range[2]
            end
            Bat_fitness = Query(optimiser,Position[1],Position[2])
            for i in 1:2
                B.fitness[bat] = Bat_fitness
                B.Population[i,bat] = Position[i]
            end
            @fastmath B.Pulse_Rate[bat] = B.Pulse_Rate_Decay*(1-exp(-B.Gamma*step))
            if B.Loudness[bat] > B.Loudness_Limit
                B.Loudness[bat] *= B.Loudness_Decay
            else
                B.Loudness[bat] = B.Loudness_Limit
            end
            if Bat_fitness < best_fit
                best_fit = Bat_fitness
                best_index = bat 
                B.Best_Position .= Position
                B.Best_fitness[] = Bat_fitness
                B.Best_Bat[] = bat
            end
        end
    end
end

function @main(ARGS::Vector{String})::Cint
    fn = Ackley()
    optimiser = Bat_algorithm(fn;Population_size = Int32(100),Num_Movements = Int32(200))
    t1 = time_ns()
    run(optimiser,fn)    
    println(Core.stdout,"time : ",1e-9*(time_ns()-t1))
    return 0;
end

julia --check-bounds=no main.jl
time : 0.000682429

but my hardware may be very different from the one in the video so one should run the numba on its own computer too, 32X though can’t really be explain in one-threaded mode just by cpu difference.

stevengj · May 14, 2026, 11:42am

See also this old post on the (lack of) composability of Numba vs Julia: Benchmark for latest julia? - #96 by ChrisRackauckas

(@ChrisRackauckas, you might want to update the link in that post, which is broken — the new link is A "Jupyter" of DiffEq: Introducing Python and R Bindings for DifferentialEquations.jl)

GeorgeGkountouras · May 16, 2026, 2:31pm

The real question is, why does nobody consider python Lisp-cursed? They have 20 package managers and 30 JIT compilers, and still need to reach out to Rust/C++/FORTRAN for performance.

Benny · May 17, 2026, 7:18am

The Lisp curse is about as substantiated as Julia’s correctness problem, that is:

One blogger infers an existential crisis from observed flaws in a low-adoption language’s ecosystem, hypothesizing the root causes are ironically notable strengths.
Users of stronger faith contest the hypothesis with other observations, like mainstream languages having similar strengths and problems.
People get anxious about investing time in the language or the effect this discourse could have on adoption.
Not a single rigorous and quantified language/ecosystem comparison crops up to substantiate any speculation above because it’s a career-sized study of moving targets.

I think it’s a good thing we can say basically anything within a respectful discussion, but it’d also be good to remember that opinions, no matter how recognizable and catchy the names, aren’t usually as rigorous as computer science terminology like “Turing completeness.”

My opinions should also be interpreted as one user’s opinions, by the way. I may give some fairly uncontroversial facts, but I hope it has been clear in this topic that my opinion is very particular to my experience and habits. Some people may be much better at getting more done with Numba, some people may think it’s less useful. I can imagine that someone who needs to directly spawn OS threads within one runtime process would pick Python’s threading over Julia’s.

Topic		Replies	Views
Comparing Python, Julia, and C++ Performance broadcast , python	21	34371	November 1, 2018
Where does Julia (ecosystem) provide the greatest speedup, and where does it lag the most behind (compared to e.g. Python)? Community	44	6058	March 10, 2021
Julia motivation: why weren't Numpy, Scipy, Numba, good enough? Community history	123	84393	September 21, 2018
General questions from Python user Performance	59	4658	March 8, 2021
Python3+@njit faster than julia (rs) General Usage performance	10	776	May 13, 2025

Julia vs Numba

Related topics