Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance?

Hey! I’m new to Julia (coming from Python) and was trying to benchmark Julia against a part of the apricot python package (submodular optimization). Unfortunately the Python version (using numba) is still a bit faster (1.5x). I read through the Performance optimization chapter and applied the following already:

  • put core components into functions,
  • used views for array slices,
  • using Threads for parallel for loops,
  • using fastmath,
  • using column first orientation.
    I asked in the Julia Slack and based on someones recommendation we improved the code by 10x but the key part (the calculate_gains function) is still slower than numba.
    Here’s the code: https://github.com/rmeinl/apricot-julia (in notebooks/ you’ll find python_reimpl and julia_reimpl ). Was wondering if anyone had any recommendations on what I could further improve. :slightly_smiling_face:
9 Likes

Is it possible for you to post a working example of that function alone?

Edit:

function get_gains!(X, current_values, idxs, gains)
    @inbounds Threads.@threads for i in eachindex(idxs)
        s = 0.0
        for j in eachindex(current_values)
            s += @fastmath sqrt(current_values[j] + X[j, idxs[i]])
        end
        gains[i] = s
    end
end;

If that is the bottleneck, have you tried to use @tturbo in the inner (or outer) loops instead @threads? Or @turbo in the inner loop at least? Seems to me that what is being computed is too simple for standard threading to be useful. If the number of elements of current_values is large this is also probably an easy use case for a GPU accelerated map reduce.

Edit 2:

In [25]:
opt2 = LazyGreedy(X_digits);
res2 = @btime fit(opt2, k);
bench2 = @benchmark fit(opt2, k);
t2 = bench2.times[1]/1e9;
  13.836 s (22202 allocations: 8.95 GiB)

Those allocations, unless you know exactly where are they coming from, are an indication of a type instability. I find very hard to debug a notebook, but that may be an indication that the program can be much faster.

5 Likes

FYI @inbounds Threads.@threads for does not remove the bound check in the loop body. See: What is the cause of this performance difference between Julia and Cython? - #4 by tkf

4 Likes

Thanks for the pointer! I put it inside the loop but it didn’t give any performance improvement

I tried using @tturbo instead of @Threads and it was slightly slower. I also tried using @avx and @turbo in the inner loop and it threw this error:

TaskFailedException

    nested task error: MethodError: no method matching subsetview(::VectorizationBase.FastRange{Int64, Static.StaticInt{0}, Static.StaticInt{1}, Int64}, ::Static.StaticInt{1}, ::Int64)
    Closest candidates are:
      subsetview(::VectorizationBase.StridedPointer{T, N, C, B, R, X, O}, ::Static.StaticInt{I}, ::Integer) where {T, N, C, B, R, X, O, I} at /Users/ricomeinl/.julia/packages/LoopVectorization/yPPLg/src/vectorizationbase_compat/subsetview.jl:2

Sure, I created a little script that only uses that function. Its here: https://github.com/rmeinl/apricot-julia/blob/main/src/calc_gains.jl

Does that mean if I explicitly add type annotations to the function I could speed it up?

What size is the data, i.e. what are d and n in the script you link? (Without having to solve ScikitLearn installation problems… the times would work as well with rand(d,n), I presume.)

And is get_gains! meant only to work with idxs == 1:n, or will these in general be in some other order? That might make a difference, and would certainly allow simplification if not required.

And finally, does it matter that X is transposed? Your loop (if I’m reading it right) would prefer that j indexes adjacent elements, which would I think be true with X_digits = permutedims(abs.(digits_data["data"]));.

2 Likes

(d, n) = (54, 581012). I changed the script to use a random matrix.

idxs is simply an array of indices, they could be in any order.

The reason I transposed it from (number of elements, dimensions) to (dimensions, number of elements) is because I read that Julia uses matrices in column first orientation, which would give me a speedup if the matrix was transposed.

1 Like

OK, great. Here are times I get, with this size. I think the transpose might not be doing quite what you’re thinking, but especially for random (or shuffled) indices it will matter quite a bit. (Both here and in python.) In that case, fancy packages seem to get me almost another factor of 2.

Xmat = rand(54, 581_000); # approx true size now
Xlazy = transpose(permutedims(Xmat));
Xmat == Xlazy  # same numbers, but different memory:  strides(Xmat), strides(Xlazy)

ind_order = collect(1:581_000);
ind_rand = rand(1:581_000, 581_000);

current = rand(54);
gains = zeros(581_000);
# get_gains_0!(Xmat, current, ind_rand, gains) # version with bounds checks

@btime get_gains!($Xmat, $current, $ind_order, $gains);  #  5.303 ms
@btime get_gains!($Xmat, $current, $ind_rand, $gains);   # 21.661 ms
@btime get_gains!($Xlazy, $current, $ind_order, $gains); #  8.233 ms
@btime get_gains!($Xlazy, $current, $ind_rand, $gains);  # 87.751 ms

using Tullio
get_7!(X, cv, idxs, gains) = @tullio gains[i] = sqrt(cv[j] + X[j, idxs[i]]) avx=false
using LoopVectorization
get_8!(X, cv, idxs, gains) = @tullio gains[i] = sqrt(cv[j] + X[j, idxs[i]])
# Assuming this is the relevant case:
@btime get_7!($Xmat, $current, $ind_rand, $gains);   # 13.965 ms
@btime get_8!($Xmat, $current, $ind_rand, $gains);   # 11.944 ms

# Eager permutedims is quite expensive, but perhaps only once:
@btime permutedims($Xmat);                # 199.528 ms 
@btime permutedims($(permutedims(Xmat))); #  61.824 ms
5 Likes

Those allocations come from idxs = findall(==(0), mask). (See sixth line from the bottom of In [23] of https://github.com/rmeinl/apricot-julia/blob/4ab1e5f397155b116a36769e59d446d4f465c40a/notebooks/julia_benchmark.ipynb)

The idxs will be a vector of Int64 with length of several hundred thousand.

n = 581012
mask = zeros(Int8, n)
mask[1] = 1
@time idxs = findall(==(0), mask);
0.007981 seconds (21 allocations: 9.001 MiB)

These are repeated 1000 times to produce about 2Γ—10⁴ allocations: 9 GiB.

3 Likes

Numba is very strict, and some times even distort the semantics of python because it can’t tolerate any type instability, for example:

1 if np.random.rand() < 0.5 else 0.5

will be turned into returning 1.0 or 0.5 (always Float).

All I’m trying to say is that if your functions work nicely in Numba naturally, you’re basically comparing a C function to Julia function, assuming they are doing similar things algorithmically, I’d hope neither is much slower than the other!

Btw in your code, X is a global non-constant, this causes type instability everywhere.


The reason to use Julia instead of Numba is of course that Numba only works with a subset of Python semantics and often doesn’t work well with third-party packages that have their own C/C++ backend. (outside of numpy ecosystem)

2 Likes

As I commented in another thread, micro-optimizing a single vectorized function that does a single operation (sums the sqrt of array elements in your case) is not really the best way to get performance improvements:

12 Likes

Following the advice above, maybe the best would be to try to avoid completely the creation of this intermediary array. It may cost as much as the gains evaluation, or even more.

1 Like

I have tried to remove those allocations.

mask = zeros(Int8, n)
β†’ deleted

idxs = 1:n
β†’ idxs = collect(1:n)

current_values += view(optimizer.X, :, best_idx)
β†’ current_values .+= view(optimizer.X, :, best_idx)

current_concave_values .= sqrt.(current_values)
β†’ deleted

current_concave_values_sum = sum(current_concave_values)
β†’ current_concave_values_sum = sum(sqrt, current_values)

mask[best_idx] = 1; idxs = findall(==(0), mask)
β†’ popat!(idxs, findfirst(==(best_idx), idxs))

The last one is essential.

See In [10] of my Jupyter notebook for more details.

Verification of the coincidence of the two results:

@show res1rev.ranking == res1.ranking
@show res1rev.gains β‰ˆ res1.gains;
res1rev.ranking == res1.ranking = true
res1rev.gains β‰ˆ res1.gains = true

Subset of show(to) of the original version:

    Tot / % measured:         20.2s / 94.6%           9.19GiB / 96.1%
   calc_gains     1.00k    12.0s  62.9%  12.0ms   13.3MiB  0.15%  13.6KiB
   select_next    1.00k    5.76s  30.2%  5.76ms   8.82GiB  100%   9.03MiB

Subset of show(to) of the revised version:

    Tot / % measured:         13.3s / 98.1%            274MiB / 5.31%
   calc_gains     1.00k    11.5s  88.7%  11.5ms   6.10MiB  42.0%  6.25KiB
   select_next    1.00k    184ms  1.41%   184ΞΌs   3.93MiB  27.0%  4.02KiB

5.76s, 8.82GiB β†’ 184ms, 3.93MiB

Postscript: Although not essential in this case, it should be noted that, in general, sum(f.(A)) for an array A will produce an allocation for f.(A). If sum(f, A) is used instead, the allocation will be zero; mean(f.(A)), minimum(f.(A)), etc. should likewise be rewritten as mean(f, A), maximum(f, A), etc.

15 Likes

Very nice! Thank you. This slashed the time of both my NaiveGreedy and LazyGreedy by almost 2x.

# NaiveGreedy old
18.885 s (104754 allocations: 8.80 GiB)

# NaiveGreedy new
10.610 s (83658 allocations: 15.04 MiB)

# LazyGreedy old
13.836 s (22202 allocations: 8.95 GiB)

# LazyGreedy new
6.264 s (200 allocations: 167.93 MiB)

Fair point. I was just wondering why the numba version was still faster. I’d have expected them to be the same and then the Julia β€œglue” code to be a lot faster

Not sure if I understand. Do you recommend using permutedims over transpose? And is tullio faster than fastmath? How would you apply your recommendations to this script: https://github.com/rmeinl/apricot-julia/blob/main/src/calc_gains.jl

transpose is lazy, so you don’t get the memory benefits of looping over columns just because you use transpose.

Better to use permutedims from the get-go. Try and read in the data to have a β€œlong” shape from the very beginning.

It would be great if you could decompose changes you made into two parts

  1. Reducing allocations, making sure to loop along contiguous memory etc.
  2. The addition of @tullio, @fastmath etc.

I would bet that you can get Numba-level performance with juse 1. and then 2. should make it go past Numba, but I’m not sure.

3 Likes

I understand that you’re just trying to explain best practices, but having actually benchmarked this problem, you are not correct.

  1. The fact that transpose is lazy doesn’t matter here, because the processing is done on a type wrapping Matrix{Float64}, so convert(Matrix{Float64}, A) is automatically called.
  2. The large majority of time is spent in a single function. Removing allocations does have a large effect, but nowhere enough to get it to Numba speed.
  3. Numba is actually really fast here. Like, really fast. Fast enough that I seriously doubt you can beat it using only Base Julia.

Without having played around with Tullio etc, I believe the major reason Numba is faster than Base Julia here is that it schedules its threads more effectively than Threads.@threads for the hot inner loop.

All this is, of course, without thinking about the algorithm itself. It may very well be that @stevengj is on point in his comment, and this problem can be solved more effectively by doing more scalar operations instead of boradcasted operations.

6 Likes

Edit: I splitted the python and Julia code into separate files (the original one, not the one optimized by @genkuroki). This is what I get:

% python3 apricot.py
-------------------------------------------------
Fitting time:  12.442765712738037
125818832 9897.307670089194
-------------------------------------------------

% julia -t auto -i apricot.jl
  Activating environment at `~/Drive/Work/JuliaPlay/apricot/Project.toml`
---------------------------------------------
 Fitting time:
 12.268594 seconds (22.16 k allocations: 8.951 GiB, 0.22% gc time)
125819832 9897.307670089054
---------------------------------------------

The files are here: https://gist.github.com/lmiq/828133335a83088e63f21152416c7f7c

The β€œhot loop” takes 2.5 seconds of those:

 ───────────────────────────────────────────────────────────────────
                            Time                   Allocations      
                    ──────────────────────   ───────────────────────
  Tot / % measured:      53.3s / 4.86%           19.9GiB / 0.00%    

 Section    ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────
 hot loop    9.43M    2.59s   100%   275ns     0.00B   - %     0.00B
 ───────────────────────────────────────────────────────────────────


Thus, I am not sure if I’m doing everything wrong, or if we are not really comparing apples to apples in general.

@lmiq you’re comparing Julia’s LazyGreedy with Python’s NaiveGreedy. I should’ve made it more explicit but the Python fit() method refers to the NaiveGreedy which is a lot faster in Julia.

2 Likes