Performance regression of Cuba.jl on Julia 0.6

I noticed a performance regression of Cuba.jl on Julia 0.6, compared to Julia 0.5. It can be reproduced with these simple examples

Julia 0.5:

julia> using Cuba, BenchmarkTools

julia> @benchmark cuhre((x,f) -> f[1] = x[1])
BenchmarkTools.Trial: 
  memory estimate:  39.98 KiB
  allocs estimate:  982
  --------------
  minimum time:     25.191 μs (0.00% GC)
  median time:      26.890 μs (0.00% GC)
  mean time:        31.741 μs (11.10% GC)
  maximum time:     2.149 ms (97.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Julia 0.6:

julia> using Cuba, BenchmarkTools

julia> @benchmark cuhre((x,f) -> f[1] = x[1])
BenchmarkTools.Trial: 
  memory estimate:  39.98 KiB
  allocs estimate:  982
  --------------
  minimum time:     32.348 μs (0.00% GC)
  median time:      35.141 μs (0.00% GC)
  mean time:        42.440 μs (10.04% GC)
  maximum time:     3.181 ms (94.20% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

On Julia 0.6 it’s almost 30% slower than in previous version. Any clue of what could be the culprit?

Before, Cuba.jl was really competitive with equivalent C/Fortran code (see https://github.com/giordano/Cuba.jl#performance), but now it’s sensibly slower

Edit: contrary to what was written in the original version of the post, it’s not necessary to checkout master to reproduce the issue.

Update I bisected some 600 commits, with a possible culprit in mind.

On commit f27c6f3ae50b45e0e6ff2305dd5031d07c8665a7 (one of the two parents of merge commit of https://github.com/JuliaLang/julia/issues/17057) performance of Cuba.jl was still fine, the same as in Julia 0.5. After that revision, it has been impossible to use the package for some time, because of the problem with world age. The first revision on which Cuba.jl became usable is bfd9c7ab805f38298a04f6fd74e6c62fadb2494c (part of https://github.com/JuliaLang/julia/pull/20167).

So the problem seems to lay somewhere between

bounds included.

If it was slower after the world age change, maybe it’s due to the cfunction change. I tested the case I care about to make sure it doesn’t slow down but maybe you are hitting the slow path. Can you do a @profile with C=true in the printing?

I just took a quick look at the differences in the number of hits on each line with @profile. One line that stood out was .../Cuba/src/Cuba.jl:101; generic_integrand!(::Int32, ::Ptr{Float64}, ::Int32, ::Ptr{Float64}, ::Ptr{Void}). On 0.5, that line accounts for 7% of the profiling hits, whereas it accounts for nearly 20% on 0.6.

1 Like

I faithfully implemented the callback as suggested here: Passing Julia Callback Functions to C Maybe now something better should be done?

Uhm, removing ::Function from that line makes the regression go away. Does it make sense?

That’s beyond my pay-grade, but I do know that the subtyping calculation got a little more expensive with the new type system. I think that landed within the commit range you’ve identified.

You can also use https://github.com/stevengj/Cubature.jl/pull/23/files#diff-b34c7d94237ad3d15683f8eccabb72b4R116 to speed it up.

Ok, for the time being I simply removed type annotation of the function, and performance is now really exciting (see update performance section of the README.md. Maybe the blog post on julialang.org should be updated (cc @stevengj)? The type annotation is also there.

I saw that PR (thinking that it could have been useful for Cuba.jl as well), but honestly I had a hard time understanding how it achieves what’s described in the commit message. Could you please give me a hint?

Instead of using

function callback(ptr::Ptr{Void})
    unsafe_pointer_to_objref(ptr)()
end
ccall(cfunction(callback, Void, (Ptr{Void},)), Void, (Any,), f)

Use

function callback(f)
    f()
end
ccall(cfunction(callback, Void, (Ref{typeof(f)},)), Void, (Any,), f)

instead.

You mean this:
https://github.com/giordano/Cuba.jl/commit/1552d06ea55b626b5493f07353681034cc85a938
?

That’s amazing, now the cuhre benchmark is ~10% faster than Fortran!

1 Like

That’s about right. You can also replace the ::Function in the function signature by a type parameter to force specialization on that. It should improve performance on cfunction construction but that shouldn’t be called in the loop so it may not matter much.

I should replace

integrand_ptr(integrand::Function) = ...

with

integrand_ptr{T}(integrand::T) = ...

right? Only this? I confirm it doesn’t change much.

Also on the function that calls integrand_ptr. Assuming this is constructed once and then the same pointer is used in the loop it shouldn’t matter too much.