Why is closure slower than handmade callable struct?

FedericoStra · May 2, 2022, 10:35am

I’m having a hard time trying to understand why in the following code a closure is slower than an equivalent hand-made callable struct.

using BenchmarkTools
using Interpolations
using LinearAlgebra: norm
using StaticArrays
using QuadGK

speed(spline, t) = norm(Interpolations.gradient1(spline, t))

# closure version

length_closure(spline) = quadgk(t -> speed(spline, t), 0, length(spline))

# hand-made struct version

struct LenIntegrand{S}
    spline::S
end

(li::LenIntegrand)(t) = speed(li.spline, t)

length_struct(spline) = quadgk(LenIntegrand(spline), 0, length(spline))

# benchmarking code

θs = range(0, 2π, length=25)[1:end-1]
xs, ys = 2cos.(θs), 0.5sin.(θs)
vec = [SA[x,y] for (x,y) in zip(xs, ys)]
spl = extrapolate(interpolate(vec, BSpline(Cubic(Periodic(OnCell())))), Periodic())

@benchmark length_closure($spl) # ~ 65 μs
@benchmark length_struct($spl)  # ~ 45 μs

I can observe the same behavior in other similar examples, where a hand-made struct always beats a closure for the purpose of fixing some arguments, whereas I expected the two implementation to be pretty much equivalent. @code_warntype does not seem to help me understand the underlying issue here.

kristoffer.carlsson · May 2, 2022, 10:51am

It could be related to Performance Tips · The Julia Language.

Changing some instances of

github.com

JuliaMath/QuadGK.jl/blob/298f76e71be8a36d6e3f16715f601c3d22c2241c/src/adapt.jl#L180


      
          or `1/sqrt(x)` singularity).
          
          
For real-valued endpoints, the starting and/or ending points may be infinite. (A coordinate
          transformation is performed internally to map the infinite interval to a finite one.)
          
          
In normal usage, `quadgk(...)` will allocate a buffer for segments. You can
          instead pass a preallocated buffer allocated using `alloc_segbuf(...)` as the
          `segbuf` argument. This buffer can be used across multiple calls to avoid
          repeated allocation.
          """
          quadgk(f, segs...; kws...) =
              quadgk(f, promote(segs...)...; kws...)
          
          
function quadgk(f, segs::T...;
                 atol=nothing, rtol=nothing, maxevals=10^7, order=7, norm=norm, segbuf=nothing) where {T}
              handle_infinities(f, segs) do f, s, _
                  do_quadgk(f, s, order, atol, rtol, maxevals, norm, segbuf)
              end
          end
          
          
"""

to quadgk(f::F, segs...; kws...) where {F} would confirm that.

FedericoStra · May 2, 2022, 11:09am

Ah great catch! I always try to remind of missed ::Function specialization in my code, but it didn’t occur to me in this case.

FedericoStra · May 2, 2022, 11:33am

I’m left with a doubt though. After running both versions, methods(quadgk)[1].specializations shows that quadgk got specialized three times:

for ::LenIntegrand{Interpolations.Extrapolation{SVector{2, Float64}, 1, Interpolations.BSplineInterpolation{SVector{2, Float64}, 1, Vector{SVector{2, Float64}}, BSpline{Cubic{Periodic{OnCell}}}, Tuple{Base.OneTo{Int64}}}, BSpline{Cubic{Periodic{OnCell}}}, Periodic{Nothing}}} (the hand-made callable struct)
for ::Function
for ::var"#1#2"{Interpolations.Extrapolation{SVector{2, Float64}, 1, Interpolations.BSplineInterpolation{SVector{2, Float64}, 1, Vector{SVector{2, Float64}}, BSpline{Cubic{Periodic{OnCell}}}, Tuple{Base.OneTo{Int64}}}, BSpline{Cubic{Periodic{OnCell}}}, Periodic{Nothing}}} (the closure).

Why did quadgk get specialized at all, in apparent contradiction to its signature and https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing?
Given that quadgk got specialized anyway, why is there still a performance difference between the two versions?

kristoffer.carlsson · May 2, 2022, 1:49pm

Actually, I think this has already been fixed on the master branch of QuadGK. Probably by https://github.com/JuliaMath/QuadGK.jl/commit/298f76e71be8a36d6e3f16715f601c3d22c2241c which added some where F.

Using the master branch I get the same performance with both an anonymous function and a callable struct.