Broadcast call for closures vs callabe objects

Hi,

I’ve got a problem that can be nicely solved with closures. I’ve explored the alternative, which is to make a custom callable object. I would like to dispatch them both as a function, so I’ve also tried making that callable object a subtype of Function.

Both the closure and function subtype have a 3x slower runtime than the callable object when using a broadcast call. This is a rather massive performance hit for a determining step.
Traceur.@trace doesn’t find anything to bark at. When using broadcast-calls, the output of @code_warntype is beyond my ability to parse. All cases have the odd T::Any which doesn’t seem to be used anywhere.

Here is are examples (not exactly minimal, sadly). The object precomputes the result and holds it in a lookup table. Values beyond the table’s range are extrapolated. Here are the three implementations. @benchmark results follow, then the boilerplate benchmarking code (each case is run in a function to avoid global scope issues).

# closure
function closure(dt::Real, grid::Array{<:Number,1})
    I1 = cumsum_kbn(grid)
    Im = I1[end]
    lut = cumsum_kbn(I1)
    t-> begin
        i = Int(t/dt)+1
        i <= length(lut) ? lut[i] : t*Im
    end
end

# Callable object
immutable Gridded{Tg<:Real, Tcf<:Number}
    dt::Tg
    Im::Tcf
    lut::Array{Tcf,1}
end
function Gridded(dt, grid)  # constructor
    I1 = cumsum_kbn(grid)
    Gridded(dt, I1[end], cumsum_kbn(I1))
end
function (cf::Gridded)(t) # callable
    i = Int(t/cf.dt)+1
    i <= length(cf.lut) ? cf.lut[i] : t*cf.Im
end

# Subtype of Function
immutable GriddedFunc{Tg<:Real, Tcf<:Number} <: Function
    dt::Tg
    Im::Tcf
    lut::Array{Tcf,1}
end
function GriddedFunc(dt, grid)  # constructor
    I1 = cumsum_kbn(grid)
    GriddedFunc(dt, I1[end], cumsum_kbn(I1))
end
function (cf::GriddedFunc)(t) # callable
    i = Int(t/cf.dt)+1
    i <= length(cf.lut) ? cf.lut[i] : t*cf.Im
end

Benchmark results:

  • closure
BenchmarkTools.Trial: 
  memory estimate:  4.47 KiB
  allocs estimate:  28
  --------------
  minimum time:     12.763 μs (0.00% GC)
  median time:      13.858 μs (0.00% GC)
  mean time:        18.482 μs (4.79% GC)
  maximum time:     4.703 ms (98.44% GC)
  --------------
  samples:          10000
  evals/sample:     1
  • callable object
BenchmarkTools.Trial: 
  memory estimate:  3.38 KiB
  allocs estimate:  2
  --------------
  minimum time:     4.480 μs (0.00% GC)
  median time:      4.793 μs (0.00% GC)
  mean time:        7.001 μs (5.59% GC)
  maximum time:     702.671 μs (98.24% GC)
  --------------
  samples:          10000
  evals/sample:     7
  • callable <: Function
BenchmarkTools.Trial: 
  memory estimate:  4.47 KiB
  allocs estimate:  28
  --------------
  minimum time:     12.763 μs (0.00% GC)
  median time:      13.493 μs (0.00% GC)
  mean time:        17.480 μs (4.67% GC)
  maximum time:     4.692 ms (97.87% GC)
  --------------
  samples:          10000
  evals/sample:     1

Benchmarking code

function bench_closure()
    a = zeros(Float64, 1000)
    a[1] = 1.0
    dt = 1.0
    f_closure = closure(dt, a)
    ts = linspace(0, 2000, 2001)
    @benchmark $f_closure.($ts)
end
function bench_callable()
    a = zeros(Float64, 1000)
    a[1] = 1.0
    dt = 1.0
    gridded = Gridded(dt, a)
    ts = linspace(0, 2000, 2001)
    @benchmark $gridded.($ts)
end
function bench_callable_function()
    a = zeros(Float64, 1000)
    a[1] = 1.0
    dt = 1.0
    gridfun = GriddedFunc(dt, a)
    ts = linspace(0, 2000, 2001)
    @benchmark $gridfun.($ts)
end

Most likely that BenchmarkTools is wrong. https://github.com/JuliaCI/BenchmarkTools.jl/issues/71

I think declaring the functions constant avoids the problem, and I get equal run times without allocations:

julia> a = zeros(Float64, 1000);

julia> a[1] = 1.0;

julia> dt = 1.0;

julia> const ts = linspace(0, 2000, 2001);

julia> const cf_closure = closure(dt, a);

julia> res = cf_closure.(ts);

julia> @benchmark $res .= cf_closure.(ts)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.186 μs (0.00% GC)
  median time:      7.449 μs (0.00% GC)
  mean time:        7.555 μs (0.00% GC)
  maximum time:     17.413 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> const gridded = Gridded(dt, a);

julia> @benchmark $res .= gridded.(ts)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.211 μs (0.00% GC)
  median time:      7.452 μs (0.00% GC)
  mean time:        7.604 μs (0.00% GC)
  maximum time:     17.475 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> const gridfun = GriddedFunc(dt, a);

julia> @benchmark $res .= gridfun.(ts)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.213 μs (0.00% GC)
  median time:      7.436 μs (0.00% GC)
  mean time:        7.500 μs (0.00% GC)
  maximum time:     13.248 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

Although that his Gridded is nearly twice as fast as mine is suspicious. I’m on a system with a 3.4 Ghz processor, so such a speed difference seems unexpected:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_LAPACK NO_LAPACKE ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

Repeating this now with a clear head yields less drastic results (must have been sloppy with the workspace). Runtime is still measurably shorter for the callable object, with reduced number of allocations.

Results are equivalent for manual benchmarking. Maybe I’re catching an issue similar to what was linked by @yuyichao.

julia> function bench(thing)
           ts = linspace(0, 2000, 2001)
           res = thing.(ts)
           display(@benchmark $res = $thing.($ts))
           res
       end;

julia> a = zeros(Float64, 1000); a[1] = 1.0; dt = 1.0;

julia> bench(closure(dt, a));
BenchmarkTools.Trial:
  memory estimate:  17.03 KiB
  allocs estimate:  29
  --------------
  minimum time:     29.903 μs (0.00% GC)
  median time:      34.644 μs (0.00% GC)
  mean time:        42.864 μs (6.38% GC)
  maximum time:     4.759 ms (98.50% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> bench(Gridded(dt, a));
BenchmarkTools.Trial:
  memory estimate:  15.81 KiB
  allocs estimate:  1
  --------------
  minimum time:     22.974 μs (0.00% GC)
  median time:      26.621 μs (0.00% GC)
  mean time:        29.779 μs (0.00% GC)
  maximum time:     167.749 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> bench(GriddedFunc(dt, a));
BenchmarkTools.Trial:
  memory estimate:  17.03 KiB
  allocs estimate:  29
  --------------
  minimum time:     29.173 μs (0.00% GC)
  median time:      30.998 μs (0.00% GC)
  mean time:        36.635 μs (2.01% GC)
  maximum time:     1.388 ms (94.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

I have yet to upgrade to 0.6.2. Will revisit, maybe that fixes it.

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

Thanks