Julia vs Numba

I used to use Numpy+Numba for some parallel computations with @nb.njit and nb.prange in Python. How does Julia compare to that? In the Python version, I used TBB/OMP threading layers and some optimization to speed up the computation from many minutes to sub-second. AFAIK, both are using LLVM under the hood, so would Julia be faster, and if yes/no, why? Thanks! :slightly_smiling_face:

I think this should be mark off topic instead of performance.

The question is quite large, and the only answer is “it depends” but I encourage you to build a full benchmark.

Sometimes we win over numba sometimes not, the reason is we give more info to the compiler than python but llvm is vast and numba may do something nice somewhere or use optimisation we refused to do for some reason.

As for the threading part I don’t think we have any composity layer with Intel threading, you could look at OneAPI.jl but it’s not that used or OpenCL.jl which may help ? Before that you would need to check if Julia threading system isn’t enough though.

Finally I want to say, we don’t tell people “even if you love python move to Julia”, I’m pretty sure we never did, if you’re fully ok with your python code I don’t see a reason to use Julia beside curiosity. You’re still very welcome of course and if you want to optimise something a lot more using numba as a proof of concept for the performance it’s fully ok too and people will help.

Marked as off topic. I would build a full implementation of the same tool later for a comprehensive benchmark. On threading, TBB allows dynamic scheduling for nested loops with different loads per inner loop, I would look up if Julia has similar things.

When I tested Numba, which was some years ago, small Julia functions were 10 times faster than small Numba functions because Julia aggressively inlines, so there’s no call overhead.

But the main advantage of Julia is the readability and maintainability:

Numba code looks like Python, but behaves very differently. That causes a lot of pain for people who have to read and maintain Python projects that use Numba.

True. And Numba’s compatibility with other tools isn’t that good either, building binaries for a tool that uses Numba is so painful that I just chose PyPI afterall. Julia probably doesn’t have such problems.

LLVM is only a fraction of any compiler, so this doesn’t really say much about performance. There would technically be a performance ceiling that one language implementation could reach for a given program, but we could easily write a LLVM-based language that is very slow. Numba did a lot of work on the front-end to allow LLVM optimizations down the line, even deviating from Python semantics in a few key ways.

But we can still compare performance of programs of course. I’d agree with yolhan_mannes that it’s a mixed bag in practice, and part of the mixed bag is that equivalent programs aren’t even easy to write. I tried to port a Julia program with a few custom nested structs and scalar functions to Numba-accelerated Python one time, and I could only manage a part of it with a significant distortion of the API and Numba usage. The developers of Python, Numpy, or Numba are not at fault for not providing the same features as Julia, just as Julia is not at fault for not providing inherited classes; they’re just different ways of instructing the machine. In my opinion, Julia has a large advantage in features because it was designed for a JIT while Numba is retrofitted to Python and NumPy, but if I needed to accelerate a Python project, I’d first reach for Numpy+Numba before interop with Julia or porting to Julia.

As for CPU multithreading, I like how base Julia works a lot more than base Python: normal functions fully integrated into asynchronous tasks on multiple OS threads on multiple CPU cores, and I didn’t need to hear about 3+ different frameworks on the way. But that’s not what Numba does, which is more bolting C/C++ frameworks onto JITed Python+NumPy. I think they did a great job there, and your speedup from >1min to <1sec is a clear win unlike my port attempt. I also really like the simplicity of Numba’s high level parallelization API over loops and array operations. Julia does have high level parallelization like that, or rather my problem is that there are MANY such third-party libraries (here’s chapter about just a few) with very little discussion on where the most updated development is; it’s similar to my problem with base Python’s parallelism, but worse in some ways. There is a good argument that these options give us more control, but at this point I prefer fewer, longer, single-core tasks (aside: this needs base Python, probably concurrent.futures.ProcessPoolExecutor, for separate no-GIL Numba functions) or the specific finer parallelism built into LinearAlgebra’s BLAS or CUDA.jl when I can, which is quite often because I’m not doing anything particularly sophisticated.

I think this is kind of understating what TBB does. The superficial answer is “yes, Julia’s task scheduler does that”, but TBB is a whole different framework. For just one example, work stealing is a major aspect, while it hasn’t landed in Julia yet.

Thanks for the comprehensive reply!

Welp, I feel like Julia is Lisp-cursed, in the sense that little centralised effort is in building mainstream libraries (apart from SciML and a few) but is spread across (perhaps unmaintained) individual projects. Although someone says this aligns with Julia’s philosophy, I don’t feel too optimistic about going along this path.

I checked the wikipedia page on this one, which says the two main approaches to dynamic scheduling are work stealing and work sharing. Does TBB use the latter?

Not that I’m aware of, but frankly that doesn’t mean much because I don’t really know how multithreading works. It’s probably more work to learn than the Julia ecosystem’s parallelization options. I also can’t really give insight on base Julia’s multithreading. For example, work sharing and work stealing are both broad approaches to balancing tasks across threads and processors; I have no idea where Julia’s thread-migrating tasks fall. It seems to do a good job of keeping my cores busy when I eyeball the Windows Task Manager, so I’m not sure how work stealing could improve that. Maybe it’s just more of an issue for the overheads of managing numerous shorter tasks, which is not something I have to deal with.

I tried implementing Julia vs Numba - #7 by lilachint unless I did a mistake we’re 32X faster here on a big enough code

using Statistics

const T = Float32

struct Ackley
    x_range ::Tuple{T,T}
    y_range ::Tuple{T,T}
    minima ::T
    location ::Tuple{T,T}
end
function Ackley()
    x_range = (T(-5),T(5))
    y_range = (T(-5),T(5))
    minima = T(0)
    location = (T(0),T(0))   
    return Ackley(x_range,y_range,minima,location) 
end

@fastmath function Query(A::Ackley,x::T, y::T) ::T
    z = -20*exp(-T(0.2)*sqrt(T(0.5)*(x*x+y*y))) - exp(T(0.5)*(cospi(2*x)+cospi(2*y))) + T(ℯ) + 20
    return z
end


function Initialise_Population(Population_size,optimiser)
    Population = Matrix{T}(undef,2,Population_size)
    fitness = Vector{T}(undef,Population_size)
    for i in 1:Population_size
        Population[1,i] = (optimiser.x_range[2]-optimiser.x_range[1]) * rand(T) + optimiser.x_range[1]
        Population[2,i] = (optimiser.y_range[2]-optimiser.y_range[1]) * rand(T) + optimiser.y_range[1]
        fitness[i] = Query(optimiser,Population[1,i] , Population[2,i])
    end
    return Population, fitness
end

struct Bat_algorithm
    x_range ::Tuple{T,T}
    y_range ::Tuple{T,T}
    Population_size ::Int32
    Population ::Matrix{T}
    fitness ::Vector{T}
    Loudness ::Vector{T}
    Pulse_Rate ::Vector{T}
    Frequency ::Matrix{T}
    Velocity ::Matrix{T}
    Movements ::Int32 
    Frequency_Range ::Tuple{T,T}
    Loudness_Decay ::T
    Loudness_Limit ::T
    Pulse_Rate_Decay ::T
    Gamma ::T
    Best_Position ::Vector{T}
    Best_fitness ::Base.RefValue{T}
    Best_Bat ::Base.RefValue{Int32}
end
function Bat_algorithm(optimiser; Population_size = Int32(100),Num_Movements=Int32(100))
    x_range = optimiser.x_range
    y_range = optimiser.y_range
    Population, fitness =  Initialise_Population(Population_size, optimiser)
    Loudness = ones(T,Population_size)
    Pulse_Rate = zeros(T,Population_size)
    Frequency = zeros(T,(2,Population_size))
    Velocity = ones(T,(2,Population_size))
    Movements = Num_Movements
    Frequency_Range = (T(0),T(1))
    Loudness_Decay = T(0.5)
    Loudness_Limit = T(0.05)
    Pulse_Rate_Decay = T(0.5)
    Gamma = T(0.5)
    Best_Position = Vector{T}(undef,2)
    Best_fitness = Ref(T(0))
    Best_Bat = Ref(Int32(0))
    return Bat_algorithm(
        x_range,
        y_range,
        Population_size,
        Population,
        fitness,
        Loudness,
        Pulse_Rate,
        Frequency,
        Velocity,
        Movements,
        Frequency_Range,
        Loudness_Decay,
        Loudness_Limit,
        Pulse_Rate_Decay,
        Gamma,
        Best_Position,
        Best_fitness,
        Best_Bat
    )
end

function update_dynamics!(Position,B::Bat_algorithm,bat,best)
    nf = B.Frequency_Range[1] - (B.Frequency_Range[2]- B.Frequency_Range[1])*rand(T)
    for i in 1:2
        nv = (B.Population[i,best] - B.Population[i,bat] ) *nf
        Position[i] = B.Population[i,bat] + nv
        B.Frequency[i,bat] = nf
        B.Velocity[i,bat] = nv
    end
    return Position
end

function run(B::Bat_algorithm, optimiser)
    best_fit, best_index = findmin(B.fitness)
    Position = T[0,0]
    for step in 2:B.Movements+1
        for bat in 1:B.Population_size
             #update the position of the bat 
            update_dynamics!(Position,B,bat,best_index)
            r1 = rand(T)
            r2 = rand(T)*2 - 1
            m = mean(B.Loudness)
            if r1 < B.Loudness[bat]
                for i in eachindex(Position)
                    Position[i] += m + r2 
                end
            end
            if Position[1] < optimiser.x_range[1]
                Position[1] = optimiser.x_range[1]
            elseif Position[1] > optimiser.x_range[2]
                Position[1] = optimiser.x_range[2]
            end
            if Position[2] < optimiser.x_range[1]
                Position[2] = optimiser.x_range[1]
            elseif Position[2] > optimiser.x_range[2]
                Position[2] = optimiser.x_range[2]
            end
            Bat_fitness = Query(optimiser,Position[1],Position[2])
            for i in 1:2
                B.fitness[bat] = Bat_fitness
                B.Population[i,bat] = Position[i]
            end
            @fastmath B.Pulse_Rate[bat] = B.Pulse_Rate_Decay*(1-exp(-B.Gamma*step))
            if B.Loudness[bat] > B.Loudness_Limit
                B.Loudness[bat] *= B.Loudness_Decay
            else
                B.Loudness[bat] = B.Loudness_Limit
            end
            if Bat_fitness < best_fit
                best_fit = Bat_fitness
                best_index = bat 
                B.Best_Position .= Position
                B.Best_fitness[] = Bat_fitness
                B.Best_Bat[] = bat
            end
        end
    end
end

function @main(ARGS::Vector{String})::Cint
    fn = Ackley()
    optimiser = Bat_algorithm(fn;Population_size = Int32(100),Num_Movements = Int32(200))
    t1 = time_ns()
    run(optimiser,fn)    
    println(Core.stdout,"time : ",1e-9*(time_ns()-t1))
    return 0;
end
julia --check-bounds=no main.jl
time : 0.000682429

but my hardware may be very different from the one in the video so one should run the numba on its own computer too, 32X though can’t really be explain in one-threaded mode just by cpu difference.