I get a significant speedup if I use this function for the hot loop, @rmeinl:
function get_gains!(X, current_values, idxs, gains)
@tturbo for i in eachindex(idxs)
s = 0.0
for j in eachindex(current_values)
s += sqrt(current_values[j] + X[j, idxs[i]])
end
gains[i] = s
end
end;
Anyway, thanks for the feedback. I still think that if numba is effectively doing much better than @tturbo here we need to localize the difference in the minimal working example possible and bring Chris to the table :-).
Great point thanks! I was already using the popat function suggested above so the speedup was not as drastic but it still improved a little. The latest times I got were:
# NaiveGreedy
10.766 s (82941 allocations: 15.02 MiB)
# LazyGreedy
5.904 s (201 allocations: 167.93 MiB)
# NaiveGreedy old
11.923 s (82912 allocations: 15.02 MiB)
function get_gains!(X, current_values, idxs, gains)
Threads.@threads for i in eachindex(idxs)
s = 0.0
for j in eachindex(current_values)
@inbounds s += @fastmath sqrt(current_values[j] + X[j, idxs[i]])
end
gains[i] = s
end
end;
#############################
# NaiveGreedy with @tturbo
12.853 s (27 allocations: 8.89 MiB)
function get_gains!(X, current_values, idxs, gains)
@tturbo for i in eachindex(idxs)
s = 0.0
for j in eachindex(current_values)
s += sqrt(current_values[j] + X[j, idxs[i]])
end
gains[i] = s
end
end;
This is probalby a MWE for the critical loop, in Julia:
using BenchmarkTools
using LoopVectorization
function get_gains!(x, y, idxs, gains)
@tturbo for i in eachindex(idxs)
s = 0.0
for j in eachindex(y)
s += sqrt(y[j] + x[j, idxs[i]])
end
gains[i] = s
end
end;
d = 54
n = 581012
x = rand(54, 581012)*100;
y = zeros(Float64, d)
idxs = collect(1:n);
gains = zeros(Float64, n);
@benchmark get_gains!($x, $y, $idxs, $gains)
Do you agree this is the bottleneck? But I’m too unfamiliar with Python and Numba to be sure to write something comparable there.
If we can really localize and quantify the performance difference, than we can post an issue, somewhere so someone can take a look at it.
ps: what is the Python time for the naive greedy algorithm in this case?
The MWE looks good and yes, it seems to be that get_gains! is the bottleneck.
The Python version ran in 13.24439001083374 for the NaiveGreedy but I slashed a lot of time with the removal of allocations as mentioned above.
Thank you for the very well documented example. If you or someone else can provide a MWE of the gains function above in Python/Numba, just to check if there is effectively a huge difference between the parallelization and simd Numba is able to do in comparison with what Julia/turbo does, it would be nice. Just so we understand exactly what happened overall and, perhaps, document it for improvements.
Thanks. I get the same. I tried shuffling the indexes of indxs, with which the Julia code runs in 30ms instead, but still 600ms to the python code. Even a plain loop without turbo or threading runs in 145ms, thus either we are not measuring what we are expecting to measure, or the original difference of the codes is somewhere else.
I’m not sure if this will really help. But as far as I understand, the original problem was that this python script runs in 15s, while this julia script runs in 26s.
The julia script there uses @threads, which if replaced by @tturbo (the two get_gains! functions are there), accelerates to 22s (here). Thus @threads is doing a poor job, @tturbo is much better, something somewhat expected since the loop does a very cheap calculation.
Why exactly the python script is still faster is not clear to me yet, but I am inclined to believe that the get_gains! function and the efficiency of that parallelization/vectorization is not anymore related to the difference. That is because when we tried to trim down the peformance of that function, here and here, it seems that Julia is doing fine relative to Numba (with or without @turbo).
The “solution” went on on rewriting the algorithm, and the improvements proposed above led to this “naive greedy” version which runs in 13s, but with the same “hot loop above”, and the improvement of the “lazy greedy” version, which runs in 7s.
Thus, part of the “original problem” was the poor performance of @threads for a loop that does very cheap calculations, but the greatest improvements came from changing other things not related to that loop. I still find the original problem interesting because, while clearly the path of removing allocations etc is expected to improve the Julia code, it is not completely evident where the Python script is getting those 7-10s advantage in the original implementations (and it is not in the Numba jit part).
In my experience, Julia tends to be a bit faster than Numba when things go well in general; Numba is also fast enough and quite useful, so I encourage my friends to use Numba. But Numba requires special limiting conditions to be successful; it can’t be “anything goes” like Julia. Particular attention should be paid to the use of multiple libraries in combination. For example, take a look at the following:
The only operation needed is to remove the element idxs[idx] from the array idxs. A straightforward Julia translation would be popat!(idxs, idx), even if there is a better way.
Well, I do not sort of agree that that was a straightforward translation. You improved what was being done, at least as far as I can understand. The np.where command in python is returning the array of elements for which mask==0:
Thus, what is the miracle python is doing there? Why that findall is slowing down so much the Julia code but not Python one?
IMO the “straighforward” translation would be pretty much what the OP posted originally, and while I agree that it can be improved (as you did), I don’t see why it is 50% slower in Julia than in Python.
Because findall(==(0), mask) grows incrementally by push!ing new elements, it requires several copies and is probably hard if not impossible to SIMD. It’s faster to use findall(mask .== 0), which generates a BitVector that allows for efficient determination of the output size (probably popcnt on the underlying UInt64 storage). Python is probably doing something similar.
That is a nice insight. Unfortunately my microbenchmarks indicate that findall is faster than np.where. But in that respect I must assume that I am not completely confident that I’m benchmarking the python script correctly:
Python and julia have very differenc GCs that are optimized for different things. I wonder if python’s gc is just better optimized for this particular allocation pattern. Julia’s GC is noncompacting, mark and sweep, probably optimized for high throughput with large (say big matrix) allocations, whereas python’s is based on reference counting, probably optimized for many small allocations since that’s the way you are supposed to write python code.
I say probably since I’m not an expert at garbage collection, but merely guessing.