Why are my higher order functions a million times slower than they should be?

While this example looks highly constructed, I ran into this performance issue in real code and got this as a MWE. I expect a slight penalty for calling a function dynamically (i.e. _slow instead of the constant slow) but I don’t expect a million-fold slowdown. What really confuses me, though, is that llvm can optimize _slow into nothing, @code_warntypes shows no problems, and replacing f with y -> f(y) fixes it.

apply(f, x) = f(x)

function slow(f, x)
    y = apply(f, x)
    for _ in eachindex(y)
        
    end
end
_slow = slow

function fast(f, x)
    y = apply(y -> f(y), x)
    for _ in eachindex(y)
        
    end
end
_fast = fast

display(@benchmark _slow(identity, x) setup=(x=rand(10^6)))
display(@benchmark _fast(identity, x) setup=(x=rand(10^6)))
@code_llvm _slow(identity, rand(10^6))
BenchmarkTools.Trial: 75 samples with 1 evaluation.
 Range (min … max):  61.835 ms … 75.942 ms  β”Š GC (min … max): 3.69% … 3.74%
 Time  (median):     64.182 ms              β”Š GC (median):    4.50%
 Time  (mean Β± Οƒ):   64.818 ms Β±  2.562 ms  β”Š GC (mean Β± Οƒ):  4.27% Β± 0.78%

   β–‚  β–ˆβ–†   β–‚     β–‚                                             
  β–ˆβ–ˆβ–β–ˆβ–ˆβ–ˆβ–†β–„β–†β–ˆβ–†β–ˆβ–„β–ˆβ–„β–ˆβ–„β–†β–„β–„β–„β–β–†β–„β–„β–†β–ˆβ–ˆβ–β–β–ˆβ–„β–†β–β–β–β–„β–β–„β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ ▁
  61.8 ms         Histogram: frequency by time        72.9 ms <

 Memory estimate: 61.02 MiB, allocs estimate: 2998978.
BenchmarkTools.Trial: 1769 samples with 997 evaluations.
 Range (min … max):  20.740 ns …  1.767 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     24.945 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.029 ns Β± 42.861 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–‚β–ˆβ–‡β–„β–                                               
  β–‚β–‚β–ƒβ–‚β–…β–…β–…β–‡β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–ƒβ–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–β–β–β–‚β–‚β–β–β–‚β–β–β–β–‚β–β–β–β–‚β–β–β–β–β–‚β–β–β–‚β–‚β–‚β–‚ β–ƒ
  20.7 ns         Histogram: frequency by time        42.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
;  @ /Users/x/.julia/dev/slow_HOF_demo.jl:3 within `slow`
define void @julia_slow_5013({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
;  @ /Users/x/.julia/dev/slow_HOF_demo.jl:7 within `slow`
  ret void
}

Can someone please explain what is going on here?

Performance Tips Β· The Julia Language.

Care to elaborate? Do you have any idea why replacing f with y -> f(y) fixes it?

Also, note that this issue emerged in my real-world use case without using any global variables at all. I only need global variables to communicate with BenchmarkTools.

I think that is just a benchmarking artifact, from the fact that _slow is a non-constant global, and you are not interpolating it when calling the benchmark.

julia> apply(f, x) = f(x)
apply (generic function with 1 method)

julia> function slow(f, x)
           y = apply(f, x)
           for _ in eachindex(y)
               
           end
       end
slow (generic function with 1 method)

julia> _slow = slow
slow (generic function with 1 method)

julia> @btime _slow(identity, x) setup=(x=rand(10^6))
  52.838 ms (2998978 allocations: 61.02 MiB)

julia> @btime $_slow($identity, x) setup=(x=rand(10^6))
  1.604 ns (0 allocations: 0 bytes)

julia> const _slow2 = slow
slow (generic function with 1 method)

julia> @btime _slow2(identity, x) setup=(x=rand(10^6))
  1.319 ns (0 allocations: 0 bytes)

This issue can be reproduced without any global variables:

function run()
    for f in [slow, fast]
        x=rand(10^8)
        @time f(identity, x)
    end
end

run()
run()
 10.807997 seconds (300.00 M allocations: 5.961 GiB, 4.94% gc time, 0.22% compilation time)
  0.051797 seconds (3.34 k allocations: 233.651 KiB, 99.24% compilation time)
  8.231645 seconds (300.00 M allocations: 5.960 GiB, 6.11% gc time)
  0.000018 seconds
1 Like

Ah, ok. The thing is that Julia does not specialize for the function type if the function is not called from withing the function. Thus, the fix is:

ulia> function slow(f::F, x) where F # annotate F type here !
           y = apply(f, x)
           for _ in eachindex(y)
               
           end
       end
slow (generic function with 1 method)

julia> function run()
           for f in [slow, fast]
               x = rand(10^6)
               @time f(identity, x)
           end
       end
run (generic function with 1 method)

julia> run()
  0.003013 seconds (3.17 k allocations: 170.994 KiB, 99.51% compilation time)
  0.000002 seconds

julia> run()
  0.000009 seconds
  0.000005 seconds

This is actually the performance tip to be considered here (a tricky one, I learnt this the hard way as well):

https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing

6 Likes

Thanks!

This is why I had trouble finding the performance bug:

Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question.

And the approach used in fast (calling f) also forces specialization. This explains everything.

2 Likes