What’s called the “garbage collector” in Python is a separate GC that is only used to break reference cycles. This GC supplements the reference counting scheme that is used for regular memory management. So the disabling of the GC that %timeit does will only stop the cycle collector, regular reference counting will still be active. But this can still have an effect on the timing outcome, of course.
I have also made a version that uses bitvector masking to create the index vectors idxs repeatedly (with huge allocations), apart from the Julia version using popat!. The bitvector masking Julia version is a translation more straightforward than the popat! version.
The bitvector masking version appears to be a completely straightforward translation, line by line, from the Python version. Look at the screenshot of the Jupyter notebooks, Julia bitvector masking version and Python version, below.
Ok, here is a MWE, the Juila code is 10x slower than the Python code:
In Julia:
function fit()
local idxs
mask = zeros(Int, 500_000)
for i in 1:1000
idxs = findall(==(0), mask)
end
return length(idxs)
end
println(fit())
@time fit();
~
Python:
import time
import numpy as np
def fit():
mask = np.zeros(500_000)
for i in range(0,1000):
idxs = np.where(mask == 0)[0]
return idxs.size
print(fit())
tic = time.time()
fit()
toc0 = time.time() - tic
print(toc0)
If one replaces findall(==(0),mask) by findall(mask .== 0), the Julia code improves a lot, but it is still 50% slower than the python one.
IMO that places some questioning on possible improvements on findall.
I appreciate that, but I do not think that translating np.zeros(n) to trues(n), and masking by the bit vector are straightforward translations. What I wanted to know is where the difference was coming from, not if a faster version could be written. The “problem” with findall is what explains most of it, although I think I have noticed an important time gain when removed the call to argmax, something that was not noticeable in the Python version.
Since there is no direct counterpart to Python’s np.where in Julia (findall(==(0), mask) is also not considered the strictly direct counterpart of np.where(mask == 0)[0]), that level of difference is unavoidable.
If you are dissatisfied, please rewrite the code using Julia’s BitVector counterpart on the Python side.
Also, note that the number of characters in the Julia side code is reduced. It would be odd to prohibit something that can be written simply on the Julia side.
It would be a definite fact that miracle does not exist.
If we had written the code normally on the Julia side, we could have done it faster than the Python Numba version, without having to use a particularly excellent tool like LoopVectorization.jl.
It would be, however, very interesting if someone could report on the details of the implementation of np.where in Python.
My point is that for me, and probably for others, it was not obvious at all that findall(==(0),x) could be so much worse than findall(x .== 0). Personally, I would even think that the opposite would be true, since it its evident that x .== 0 allocates an intermediate vector that could be avoided.
It might well be that depending on the test function, size and type of vectors, the implementation of findall is the best one can get considering pros and cons. But I, at least, learnt something (about Julia) here.
Interestingly, it appears this slowness has already been fixed in master [EDIT:] and 1.7. In array.jl line 2280:
# Broadcasting is much faster for small testf, and computing
# integer indices from logical index using findall has a negligible cost
findall(testf::Function, A::AbstractArray) = findall(testf.(A))
So this performs the same as findall(mask .== 0). It seems that findall was written generically for any collection, and this specialization for AbstractArray was added recently.
julia> function fit()
local idxs
mask = zeros(Int, 500_000)
for i in 1:1000
mask_nz = Array{Bool}(mask)
num_nz = count(mask_nz)
if num_nz == 0
idxs = Int[]
else
idxs = findall(mask_nz)
end
end
return length(idxs)
end
fit (generic function with 1 method)
julia> @time fit()
0.551720 seconds (3.00 k allocations: 480.835 MiB, 1.18% gc time)
0
In comparison, the Python code takes 0.7 seconds:
In [6]: import time
...: import numpy as np
...:
...: def fit():
...: mask = np.zeros(500_000)
...: for i in range(0,1000):
...: idxs = np.where(mask == 0)[0]
...: return idxs.size
...:
...: print(fit())
...: tic = time.time()
...: fit()
...: toc0 = time.time() - tic
...: print(toc0)
...:
500000
0.7331986427307129
The version above has an issue. What happens if mask contains values that are not 0 or 1. In that case, we can use count which turns out to be even faster.
julia> function fit()
local idxs
mask = zeros(Int, 500_000)
for i in 1:1000
num_nz = count(!=(0),mask)
if num_nz == 0
idxs = Int[]
else
idxs = findall(!=(0), mask)
end
end
return length(idxs)
end
fit (generic function with 1 method)
julia> @time fit()
0.289683 seconds (1.00 k allocations: 3.891 MiB)
0
I just realized that I inverted the problem in the process of reverse engineering NumPy, here’s the fix:
julia> function fit()
local idxs
mask = zeros(Int, 500_000)
for i in 1:1000
num_nz = count(==(0),mask)
if num_nz == 0
idxs = Int[]
elseif num_nz == length(mask)
idxs = eachindex(mask)
else
idxs = findall(==(0), mask)
end
end
return length(idxs)
end
fit (generic function with 1 method)
julia> @time fit()
0.288805 seconds (2 allocations: 3.815 MiB)
500000
I don’t think you have to, if you use master. That version of findall(==(0), mask) now ends up calling the version in bitarray.jl, which starts like this:
function findall(B::BitArray)
nnzB = count(B)
I = Vector{eltype(keys(B))}(undef, nnzB)
nnzB == 0 && return I
So I believe it’s already similar to NumPy by counting non-zeros, [EDIT: my order was wrong] pre-allocating, and shortcutting if there are none.
I really think that benchmarking code should be separated from the code to be benchmarked. When benchmarking loops are mixed into the main code, it makes it hard to understand what could be optimized and improved.