Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance?

Thanks for the clarification, I will see if I get things right :slight_smile: . By the way, in this LazyGreedyJulia version, this line:

idxs = findall(==(0), mask)

is not doing anything useful, as idxs are not used anywhere in the loop. Just by commenting that line you get a 2x speedup.

edit:

If all is correct now, this is the time for the NaiveGreedy version:

---------------------------------------------
 Fitting time:
 19.153213 seconds (65.21 k allocations: 8.799 GiB, 0.13% gc time)
125819832 9897.307670089183
---------------------------------------------

I get a significant speedup if I use this function for the hot loop, @rmeinl:

function get_gains!(X, current_values, idxs, gains)
    @tturbo for i in eachindex(idxs)
        s = 0.0
        for j in eachindex(current_values)
            s += sqrt(current_values[j] + X[j, idxs[i]])
        end
        gains[i] = s
    end
end;

(from ~20 to about ~13s in my computer).

All files are here: https://gist.github.com/lmiq/828133335a83088e63f21152416c7f7c

Anyway, thanks for the feedback. I still think that if numba is effectively doing much better than @tturbo here we need to localize the difference in the minimal working example possible and bring Chris to the table :-).

Great point thanks! I was already using the popat function suggested above so the speedup was not as drastic but it still improved a little. The latest times I got were:

# NaiveGreedy
10.766 s (82941 allocations: 15.02 MiB)

# LazyGreedy
5.904 s (201 allocations: 167.93 MiB)

The most up-to-date functions are in julia_reimpl here: https://github.com/rmeinl/apricot-julia

Interesting! I get these:

# NaiveGreedy old 
11.923 s (82912 allocations: 15.02 MiB)

function get_gains!(X, current_values, idxs, gains)
    Threads.@threads for i in eachindex(idxs)
        s = 0.0
        for j in eachindex(current_values)
            @inbounds s += @fastmath sqrt(current_values[j] + X[j, idxs[i]])
        end
        gains[i] = s
    end
end;

#############################

# NaiveGreedy with @tturbo
12.853 s (27 allocations: 8.89 MiB)

function get_gains!(X, current_values, idxs, gains)
    @tturbo for i in eachindex(idxs)
        s = 0.0
        for j in eachindex(current_values)
            s += sqrt(current_values[j] + X[j, idxs[i]])
        end
        gains[i] = s
    end
end;
1 Like

This is probalby a MWE for the critical loop, in Julia:

using BenchmarkTools
using LoopVectorization

function get_gains!(x, y, idxs, gains)
    @tturbo for i in eachindex(idxs)
        s = 0.0
        for j in eachindex(y)
            s += sqrt(y[j] + x[j, idxs[i]])
        end
        gains[i] = s
    end
end;

d = 54
n = 581012
x = rand(54, 581012)*100;
y = zeros(Float64, d)
idxs = collect(1:n);
gains = zeros(Float64, n);

@benchmark get_gains!($x, $y, $idxs, $gains)

Do you agree this is the bottleneck? But I’m too unfamiliar with Python and Numba to be sure to write something comparable there.

If we can really localize and quantify the performance difference, than we can post an issue, somewhere so someone can take a look at it.

ps: what is the Python time for the naive greedy algorithm in this case?

1 Like

The MWE looks good and yes, it seems to be that get_gains! is the bottleneck.
The Python version ran in 13.24439001083374 for the NaiveGreedy but I slashed a lot of time with the removal of allocations as mentioned above.

1 Like

I think it’s fine from here though! Thanks so much for your help. All of you!

1 Like

Thank you for the very well documented example. If you or someone else can provide a MWE of the gains function above in Python/Numba, just to check if there is effectively a huge difference between the parallelization and simd Numba is able to do in comparison with what Julia/turbo does, it would be nice. Just so we understand exactly what happened overall and, perhaps, document it for improvements.

I have one here: https://github.com/rmeinl/apricot-julia/blob/main/src/calc_gains.py Not sure if I’m using the right tools for benchmarking though, the output is 690.234ms compared to 8ms in Julia

3 Likes

Thanks. I get the same. I tried shuffling the indexes of indxs, with which the Julia code runs in 30ms instead, but still 600ms to the python code. Even a plain loop without turbo or threading runs in 145ms, thus either we are not measuring what we are expecting to measure, or the original difference of the codes is somewhere else.

1 Like

Any easy example I can copy/paste?
Also, what CPUs are you both using?

Also, the size of current_values, X, and idxs?

I’m not sure if this will really help. But as far as I understand, the original problem was that this python script runs in 15s, while this julia script runs in 26s.

The julia script there uses @threads, which if replaced by @tturbo (the two get_gains! functions are there), accelerates to 22s (here). Thus @threads is doing a poor job, @tturbo is much better, something somewhat expected since the loop does a very cheap calculation.

Why exactly the python script is still faster is not clear to me yet, but I am inclined to believe that the get_gains! function and the efficiency of that parallelization/vectorization is not anymore related to the difference. That is because when we tried to trim down the peformance of that function, here and here, it seems that Julia is doing fine relative to Numba (with or without @turbo).

The “solution” went on on rewriting the algorithm, and the improvements proposed above led to this “naive greedy” version which runs in 13s, but with the same “hot loop above”, and the improvement of the “lazy greedy” version, which runs in 7s.

Thus, part of the “original problem” was the poor performance of @threads for a loop that does very cheap calculations, but the greatest improvements came from changing other things not related to that loop. I still find the original problem interesting because, while clearly the path of removing allocations etc is expected to improve the Julia code, it is not completely evident where the Python script is getting those 7-10s advantage in the original implementations (and it is not in the Numba jit part).

6 Likes

On my Windows 10 Python 3.8.5 environment, the execution time of fit(X=X_digits, k=k) in python_reimpl.ipynb is about 14-15 seconds. See https://github.com/genkuroki/public/blob/main/0016/apricot/python_reimpl.ipynb

I have made my own direct translation of python_reimpl.ipynb to Julia: julia_translation_of_python_reimpl.ipynb.

That translation does not use LoopVectorizations.jl, only uses Threads.@threads. It is just a simple direct translation that contains nothing special.

Screenshot for line-by-line comparison between the Python version and the Julia version:

In my Julia 1.6.2 environment, the execution time of fit(X_digits, k) in that direct translation is about 12-13 seconds.

The plots of the two results are shown below.

Python 3.8.5 Numba: 13~14 sec

↓ straightforward simple translation

Julia 1.6.2: 12~13 sec

In my experience, Julia tends to be a bit faster than Numba when things go well in general; Numba is also fast enough and quite useful, so I encourage my friends to use Numba. But Numba requires special limiting conditions to be successful; it can’t be “anything goes” like Julia. Particular attention should be paid to the use of multiple libraries in combination. For example, take a look at the following:

EDIT1: Fix a logical bug of julia_translation_of_python_reimpl.ipynb and update the image.

EDIT2: Add a screenshot for line-by-line comparison.

10 Likes

Can you readily see now what’s so bad about this script? https://gist.github.com/lmiq/828133335a83088e63f21152416c7f7c#file-apricot_naivegreedy_old-jl

As already pointed out, idxs = findall(==(0), mask) causes huge memory allocation.

The only operation needed is to remove the element idxs[idx] from the array idxs. A straightforward Julia translation would be popat!(idxs, idx), even if there is a better way.

Document of Base.popat!

Well, I do not sort of agree that that was a straightforward translation. You improved what was being done, at least as far as I can understand. The np.where command in python is returning the array of elements for which mask==0:

>>>> import  numpy as np
>>>> mask = np.random.randint(0,high=10,size=10)
>>>> mask
array([2, 1, 0, 1, 4, 6, 0, 4, 3, 3])
>>> idxs = np.where(mask == 0)[0]
>>> idxs
array([2, 6])

This seems equivalent to use findall in Julia:

julia> mask = rand(0:10,10);

julia> idxs= findall(isequal(0),mask)
2-element Vector{Int64}:
 1
 4

Thus, what is the miracle python is doing there? Why that findall is slowing down so much the Julia code but not Python one?

IMO the “straighforward” translation would be pretty much what the OP posted originally, and while I agree that it can be improved (as you did), I don’t see why it is 50% slower in Julia than in Python.

1 Like

Because findall(==(0), mask) grows incrementally by push!ing new elements, it requires several copies and is probably hard if not impossible to SIMD. It’s faster to use findall(mask .== 0), which generates a BitVector that allows for efficient determination of the output size (probably popcnt on the underlying UInt64 storage). Python is probably doing something similar.

3 Likes

That is a nice insight. Unfortunately my microbenchmarks indicate that findall is faster than np.where. But in that respect I must assume that I am not completely confident that I’m benchmarking the python script correctly:

>>> mask = np.random.randint(0,high=10,size=1_000_000)
>>> 1000*timeit.timeit("np.where(mask==0)", number=1, globals=globals())
17.385129001922905 # miliseconds
>>> 
julia> mask = rand(0:10,1_000_000);

julia> @btime findall(isequal(0),$mask)
  2.366 ms (18 allocations: 2.00 MiB)

Interesting. Look at this:

jl> @btime findall(==(0), r) setup=(r=rand((0,1), 100_000));
  624.500 μs (17 allocations: 1.00 MiB)

jl> @btime findall(r .== 0) setup=(r=rand((0,1), 100_000));
  68.600 μs (5 allocations: 402.48 KiB)

Here’s a direct comparison to numpy:

In [2]: mask = np.random.randint(0,high=10,size=1_000_000)

In [3]: %timeit np.where(mask==0)
1.78 ms ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Julia:

jl> @btime findall(==(0), r) setup=(r=rand(0:10, 1_000_000));
  1.770 ms (18 allocations: 2.00 MiB)

jl> @btime findall(r .== 0) setup=(r=rand(0:10, 1_000_000));
  575.500 μs (6 allocations: 831.19 KiB)
2 Likes

Python and julia have very differenc GCs that are optimized for different things. I wonder if python’s gc is just better optimized for this particular allocation pattern. Julia’s GC is noncompacting, mark and sweep, probably optimized for high throughput with large (say big matrix) allocations, whereas python’s is based on reference counting, probably optimized for many small allocations since that’s the way you are supposed to write python code.

I say probably since I’m not an expert at garbage collection, but merely guessing.

ehhh, guys, %timeit turns off GC during the timing

3 Likes