I noticed a performance regression of Cuba.jl
on Julia 0.6, compared to Julia 0.5. It can be reproduced with these simple examples
Julia 0.5:
julia> using Cuba, BenchmarkTools
julia> @benchmark cuhre((x,f) -> f[1] = x[1])
BenchmarkTools.Trial:
memory estimate: 39.98 KiB
allocs estimate: 982
--------------
minimum time: 25.191 μs (0.00% GC)
median time: 26.890 μs (0.00% GC)
mean time: 31.741 μs (11.10% GC)
maximum time: 2.149 ms (97.00% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
Julia 0.6:
julia> using Cuba, BenchmarkTools
julia> @benchmark cuhre((x,f) -> f[1] = x[1])
BenchmarkTools.Trial:
memory estimate: 39.98 KiB
allocs estimate: 982
--------------
minimum time: 32.348 μs (0.00% GC)
median time: 35.141 μs (0.00% GC)
mean time: 42.440 μs (10.04% GC)
maximum time: 3.181 ms (94.20% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
On Julia 0.6 it’s almost 30% slower than in previous version. Any clue of what could be the culprit?
Before, Cuba.jl was really competitive with equivalent C/Fortran code (see https://github.com/giordano/Cuba.jl#performance), but now it’s sensibly slower
Edit: contrary to what was written in the original version of the post, it’s not necessary to checkout master to reproduce the issue.
Update I bisected some 600 commits, with a possible culprit in mind.
On commit f27c6f3ae50b45e0e6ff2305dd5031d07c8665a7 (one of the two parents of merge commit of https://github.com/JuliaLang/julia/issues/17057) performance of Cuba.jl
was still fine, the same as in Julia 0.5. After that revision, it has been impossible to use the package for some time, because of the problem with world age. The first revision on which Cuba.jl
became usable is bfd9c7ab805f38298a04f6fd74e6c62fadb2494c (part of https://github.com/JuliaLang/julia/pull/20167).
So the problem seems to lay somewhere between
bounds included.
If it was slower after the world age change, maybe it’s due to the cfunction change. I tested the case I care about to make sure it doesn’t slow down but maybe you are hitting the slow path. Can you do a @profile
with C=true
in the printing?
I just took a quick look at the differences in the number of hits on each line with @profile
. One line that stood out was .../Cuba/src/Cuba.jl:101; generic_integrand!(::Int32, ::Ptr{Float64}, ::Int32, ::Ptr{Float64}, ::Ptr{Void})
. On 0.5, that line accounts for 7% of the profiling hits, whereas it accounts for nearly 20% on 0.6.
1 Like
I faithfully implemented the callback as suggested here: Passing Julia Callback Functions to C Maybe now something better should be done?
Uhm, removing ::Function
from that line makes the regression go away. Does it make sense?
That’s beyond my pay-grade, but I do know that the subtyping calculation got a little more expensive with the new type system. I think that landed within the commit range you’ve identified.
Ok, for the time being I simply removed type annotation of the function, and performance is now really exciting (see update performance section of the README.md
. Maybe the blog post on julialang.org should be updated (cc @stevengj)? The type annotation is also there.
I saw that PR (thinking that it could have been useful for Cuba.jl
as well), but honestly I had a hard time understanding how it achieves what’s described in the commit message. Could you please give me a hint?
Instead of using
function callback(ptr::Ptr{Void})
unsafe_pointer_to_objref(ptr)()
end
ccall(cfunction(callback, Void, (Ptr{Void},)), Void, (Any,), f)
Use
function callback(f)
f()
end
ccall(cfunction(callback, Void, (Ref{typeof(f)},)), Void, (Any,), f)
instead.
You mean this:
https://github.com/giordano/Cuba.jl/commit/1552d06ea55b626b5493f07353681034cc85a938
?
That’s amazing, now the cuhre
benchmark is ~10% faster than Fortran!
1 Like
That’s about right. You can also replace the ::Function
in the function signature by a type parameter to force specialization on that. It should improve performance on cfunction
construction but that shouldn’t be called in the loop so it may not matter much.
I should replace
integrand_ptr(integrand::Function) = ...
with
integrand_ptr{T}(integrand::T) = ...
right? Only this? I confirm it doesn’t change much.
Also on the function that calls integrand_ptr
. Assuming this is constructed once and then the same pointer is used in the loop it shouldn’t matter too much.