Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance?

lmiq · August 17, 2021, 4:20pm

That’s a good guess. But does not explain it:

@tturbo version of: Julia:
Fitting time:
20.318218 seconds (22.10 k allocations: 8.795 GiB, 0.36% gc time)
125819832 9897.307670089082

(python:)

Fitting time: 12.01971983909607
125818832 9897.307670089194

(not using timeit in these)

paulmelis · August 17, 2021, 4:21pm

What’s called the “garbage collector” in Python is a separate GC that is only used to break reference cycles. This GC supplements the reference counting scheme that is used for regular memory management. So the disabling of the GC that %timeit does will only stop the cycle collector, regular reference counting will still be active. But this can still have an effect on the timing outcome, of course.

genkuroki · August 17, 2021, 5:16pm

No miracles.

I have also made a version that uses bitvector masking to create the index vectors idxs repeatedly (with huge allocations), apart from the Julia version using popat!. The bitvector masking Julia version is a translation more straightforward than the popat! version.

The bitvector masking version appears to be a completely straightforward translation, line by line, from the Python version. Look at the screenshot of the Jupyter notebooks, Julia bitvector masking version and Python version, below.

The results of execution time are as follows:

Julia popat! version: 12-13 seconds (15MB)
Julia bitvector masking version: about or under 14 seconds (4.3GB)
Python version: 14-15 seconds

If there is a miracle on Python, then there must be a miracle on Julia. In fact there are no miracles anywhere.

EDIT: axes(mask, 1)[mask] → findall(mask)

lmiq · August 17, 2021, 5:21pm

Ok, here is a MWE, the Juila code is 10x slower than the Python code:

In Julia:

function fit()
    local idxs
    mask = zeros(Int, 500_000)
    for i in 1:1000
       idxs = findall(==(0), mask)
    end
    return length(idxs)
end
println(fit())
@time fit();
~

Python:

import time
import numpy as np

def fit():
    mask = np.zeros(500_000)
    for i in range(0,1000):
        idxs = np.where(mask == 0)[0]
    return idxs.size

print(fit())
tic = time.time()
fit()
toc0 = time.time() - tic
print(toc0)

If one replaces findall(==(0),mask) by findall(mask .== 0), the Julia code improves a lot, but it is still 50% slower than the python one.

At the end, the thing is this:

julia> @btime findall($mask .== 0);
  524.255 μs (6 allocations: 3.88 MiB)

julia> @btime findall(==(0),$mask);
  3.105 ms (20 allocations: 5.00 MiB)

IMO that places some questioning on possible improvements on findall.

I appreciate that, but I do not think that translating np.zeros(n) to trues(n), and masking by the bit vector are straightforward translations. What I wanted to know is where the difference was coming from, not if a faster version could be written. The “problem” with findall is what explains most of it, although I think I have noticed an important time gain when removed the call to argmax, something that was not noticeable in the Python version.

genkuroki · August 17, 2021, 6:35pm

Since there is no direct counterpart to Python’s np.where in Julia (findall(==(0), mask) is also not considered the strictly direct counterpart of np.where(mask == 0)[0]), that level of difference is unavoidable.

If you are dissatisfied, please rewrite the code using Julia’s BitVector counterpart on the Python side.

Also, note that the number of characters in the Julia side code is reduced. It would be odd to prohibit something that can be written simply on the Julia side.

It would be a definite fact that miracle does not exist.

If we had written the code normally on the Julia side, we could have done it faster than the Python Numba version, without having to use a particularly excellent tool like LoopVectorization.jl.

It would be, however, very interesting if someone could report on the details of the implementation of np.where in Python.

stillyslalom · August 17, 2021, 6:46pm

Looks like the core method is defined here: https://github.com/numpy/numpy/blob/e69faef03428b82f5651a5fb521610ddea2bd22a/numpy/core/src/multiarray/item_selection.c#L2600-L2633

Linked PR: https://github.com/numpy/numpy/pull/4370/files

It appears to use multithreading beyond some input array size threshold, which the Julia implementation doesn’t yet allow.

Sukera · August 17, 2021, 6:47pm

Note that numpy is BSD-3 licensed.

lmiq · August 17, 2021, 6:47pm

My point is that for me, and probably for others, it was not obvious at all that findall(==(0),x) could be so much worse than findall(x .== 0). Personally, I would even think that the opposite would be true, since it its evident that x .== 0 allocates an intermediate vector that could be avoided.

It might well be that depending on the test function, size and type of vectors, the implementation of findall is the best one can get considering pros and cons. But I, at least, learnt something (about Julia) here.

apo383 · August 17, 2021, 6:53pm

Interestingly, it appears this slowness has already been fixed in master [EDIT:] and 1.7. In array.jl line 2280:

# Broadcasting is much faster for small testf, and computing
# integer indices from logical index using findall has a negligible cost
findall(testf::Function, A::AbstractArray) = findall(testf.(A))

So this performs the same as findall(mask .== 0). It seems that findall was written generically for any collection, and this specialization for AbstractArray was added recently.

mkitti · August 17, 2021, 7:02pm

Diving into NumPy, the relevant C code is here.

https://github.com/numpy/numpy/blob/e69faef03428b82f5651a5fb521610ddea2bd22a/numpy/core/src/multiarray/item_selection.c#L2569-L2653

The first thing that happens is that NumPy counts the number of nonzeros. If the number of nonzeros is 0, then it just returns an empty array:
https://github.com/numpy/numpy/blob/e69faef03428b82f5651a5fb521610ddea2bd22a/numpy/core/src/multiarray/item_selection.c#L2596-L2598

We can implement that shortcut in Julia:

julia> function fit()
           local idxs
           mask = zeros(Int, 500_000)
           for i in 1:1000
              mask_nz = Array{Bool}(mask)
              num_nz = count(mask_nz)
              if num_nz == 0
                 idxs = Int[]
              else
                 idxs = findall(mask_nz)
              end
           end
           return length(idxs)
       end
fit (generic function with 1 method)

julia> @time fit()
  0.551720 seconds (3.00 k allocations: 480.835 MiB, 1.18% gc time)
0

In comparison, the Python code takes 0.7 seconds:

In [6]: import time
   ...: import numpy as np
   ...: 
   ...: def fit():
   ...:     mask = np.zeros(500_000)
   ...:     for i in range(0,1000):
   ...:         idxs = np.where(mask == 0)[0]
   ...:     return idxs.size
   ...: 
   ...: print(fit())
   ...: tic = time.time()
   ...: fit()
   ...: toc0 = time.time() - tic
   ...: print(toc0)
   ...: 
500000
0.7331986427307129

mkitti · August 17, 2021, 7:14pm

The version above has an issue. What happens if mask contains values that are not 0 or 1. In that case, we can use count which turns out to be even faster.

julia> function fit()
           local idxs
           mask = zeros(Int, 500_000)
           for i in 1:1000
              num_nz = count(!=(0),mask)
              if num_nz == 0
                 idxs = Int[]
              else
                 idxs = findall(!=(0), mask)
              end
           end
           return length(idxs)
       end
fit (generic function with 1 method)

julia> @time fit()
  0.289683 seconds (1.00 k allocations: 3.891 MiB)
0

mkitti · August 17, 2021, 7:19pm

I just realized that I inverted the problem in the process of reverse engineering NumPy, here’s the fix:

julia> function fit()
           local idxs
           mask = zeros(Int, 500_000)
           for i in 1:1000
              num_nz = count(==(0),mask)
              if num_nz == 0
                 idxs = Int[]
              elseif num_nz == length(mask)
                 idxs = eachindex(mask)
              else
                 idxs = findall(==(0), mask)
              end
           end
           return length(idxs)
       end
fit (generic function with 1 method)

julia> @time fit()
  0.288805 seconds (2 allocations: 3.815 MiB)
500000

apo383 · August 17, 2021, 7:22pm

I don’t think you have to, if you use master. That version of findall(==(0), mask) now ends up calling the version in bitarray.jl, which starts like this:

function findall(B::BitArray)
    nnzB = count(B)
    I = Vector{eltype(keys(B))}(undef, nnzB)
    nnzB == 0 && return I

So I believe it’s already similar to NumPy by counting non-zeros, [EDIT: my order was wrong] pre-allocating, and shortcutting if there are none.

jzr · August 17, 2021, 7:33pm

Do you need to count for each iteration?

mkitti · August 17, 2021, 7:37pm

I’m assuming that the loop is for benchmarking purposes. Otherwise, the whole operation could be just done once.

DNF · August 17, 2021, 8:11pm

I really think that benchmarking code should be separated from the code to be benchmarked. When benchmarking loops are mixed into the main code, it makes it hard to understand what could be optimized and improved.

genkuroki · August 18, 2021, 12:00am

Truely straightforward translation!

Jupyter notebook: https://github.com/genkuroki/public/blob/main/0016/apricot/float64_vector_version.ipynb

Python mask = np.zeros(n) → Julia mask = zeros(n)
Python mask[best_idx] = 1 → Julia mask[best_idx] = 1
Python idxs = np.where(mask == 0)[0] → idxs = findall(mask .== 0)

The last one is the essential part again.

The results of the execution time are as below:

Python Numba version: 14-15 sec
Julia findall(mask .== 0) version: about 14 sec

Explanation:

The type of mask is Vector{Float64}. (Normally, masking is done with Bool values and not with Float64 ones.)
mask .== 0 becomes of type BitVector.
findall(x::BitVector) is very fast.

Conclusion:

Numba does not outperform Julia in the truely straightforward translation.

Topic		Replies	Views
Benchmarks: Julia vs Python+Numba Performance benchmark , python	4	5403	January 31, 2018
Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba) Performance performance	20	3997	August 22, 2021
Python vs. Julia ODE Solver Performance python	30	4335	March 3, 2021
I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test Community fortran , performance , python , tullio , loopvectorization	37	7069	June 25, 2021
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4774	July 19, 2021

Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance?

Related topics