While this example looks highly constructed, I ran into this performance issue in real code and got this as a MWE. I expect a slight penalty for calling a function dynamically (i.e. _slow instead of the constant slow) but I donβt expect a million-fold slowdown. What really confuses me, though, is that llvm can optimize _slow into nothing, @code_warntypes shows no problems, and replacing f with y -> f(y) fixes it.
apply(f, x) = f(x)
function slow(f, x)
y = apply(f, x)
for _ in eachindex(y)
end
end
_slow = slow
function fast(f, x)
y = apply(y -> f(y), x)
for _ in eachindex(y)
end
end
_fast = fast
display(@benchmark _slow(identity, x) setup=(x=rand(10^6)))
display(@benchmark _fast(identity, x) setup=(x=rand(10^6)))
@code_llvm _slow(identity, rand(10^6))
BenchmarkTools.Trial: 75 samples with 1 evaluation.
Range (min β¦ max): 61.835 ms β¦ 75.942 ms β GC (min β¦ max): 3.69% β¦ 3.74%
Time (median): 64.182 ms β GC (median): 4.50%
Time (mean Β± Ο): 64.818 ms Β± 2.562 ms β GC (mean Β± Ο): 4.27% Β± 0.78%
β ββ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
61.8 ms Histogram: frequency by time 72.9 ms <
Memory estimate: 61.02 MiB, allocs estimate: 2998978.
BenchmarkTools.Trial: 1769 samples with 997 evaluations.
Range (min β¦ max): 20.740 ns β¦ 1.767 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 24.945 ns β GC (median): 0.00%
Time (mean Β± Ο): 27.029 ns Β± 42.861 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
βββββ β β ββ ββββββββ ββββββββββββββββββββββββββββββββββββββββββ β
20.7 ns Histogram: frequency by time 42.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
; @ /Users/x/.julia/dev/slow_HOF_demo.jl:3 within `slow`
define void @julia_slow_5013({}* nonnull align 16 dereferenceable(40) %0) #0 {
top:
; @ /Users/x/.julia/dev/slow_HOF_demo.jl:7 within `slow`
ret void
}
Care to elaborate? Do you have any idea why replacing f with y -> f(y) fixes it?
Also, note that this issue emerged in my real-world use case without using any global variables at all. I only need global variables to communicate with BenchmarkTools.
I think that is just a benchmarking artifact, from the fact that _slow is a non-constant global, and you are not interpolating it when calling the benchmark.
julia> apply(f, x) = f(x)
apply (generic function with 1 method)
julia> function slow(f, x)
y = apply(f, x)
for _ in eachindex(y)
end
end
slow (generic function with 1 method)
julia> _slow = slow
slow (generic function with 1 method)
julia> @btime _slow(identity, x) setup=(x=rand(10^6))
52.838 ms (2998978 allocations: 61.02 MiB)
julia> @btime $_slow($identity, x) setup=(x=rand(10^6))
1.604 ns (0 allocations: 0 bytes)
julia> const _slow2 = slow
slow (generic function with 1 method)
julia> @btime _slow2(identity, x) setup=(x=rand(10^6))
1.319 ns (0 allocations: 0 bytes)
Ah, ok. The thing is that Julia does not specialize for the function type if the function is not called from withing the function. Thus, the fix is:
ulia> function slow(f::F, x) where F # annotate F type here !
y = apply(f, x)
for _ in eachindex(y)
end
end
slow (generic function with 1 method)
julia> function run()
for f in [slow, fast]
x = rand(10^6)
@time f(identity, x)
end
end
run (generic function with 1 method)
julia> run()
0.003013 seconds (3.17 k allocations: 170.994 KiB, 99.51% compilation time)
0.000002 seconds
julia> run()
0.000009 seconds
0.000005 seconds
This is actually the performance tip to be considered here (a tricky one, I learnt this the hard way as well):
This is why I had trouble finding the performance bug:
Note that @code_typed and friends will always show you specialized code, even if Julia would not normally specialize that method call. You need to check the method internals if you want to see whether specializations are generated when argument types are changed, i.e., if (@which f(...)).specializations contains specializations for the argument in question.