The language is a bit loose, it’s more like the impact is small and vastly outweighed by other performance variations. Looking at specifics, when you don’t specialize on a function f
, then you have to 1) dynamically dispatch a higher-order function g(f, args)
and 2) cannot infer its return type at compile-time and need subsequent code to incorporate runtime type checks and dispatches.
In OP’s case, the 2nd part really hurt performance because the subsequent code loops a lot. This can be helped by using a function barrier, in other words putting the subsequent code in a function that can be dispatched to at runtime and compiled for the specific types (no runtime checks and dispatches).
The 1st part can hurt performance if the overall function is called frequently or in a loop, again because of uninferred return types and dispatches. Even that 1 runtime dispatch g(f, args)
can add up to a performance loss in some scenarios.
But not all code is loops and not all functions are frequently called, so slowing those down makes an insignificant impact to the program’s performance. And it’s worth avoiding the costs of compiling more code (excerpt from a link earlier in the thread):
Generating a new type for every function has potentially serious consequences for compiler resource use when combined with Julia’s “specialize on all arguments by default” design. Indeed, the initial implementation of this design suffered from much longer build and test times, higher memory use, and a system image nearly 2x larger than the baseline. In a naive implementation, the problem is bad enough to make the system nearly unusable. Several significant optimizations were needed to make the design practical.
The link goes on to more precisely explain what “using” an argument means (I really dislike that language in the Performance Tips):
Many functions simply “pass through” an argument to somewhere else, e.g. to another function or to a storage location. Such functions do not need to be specialized for every closure that might be passed in. Fortunately this case is easy to distinguish by simply considering whether a function calls one of its arguments (i.e. the argument appears in “head position” somewhere). Performance-critical higher-order functions like
map
certainly call their argument function and so will still be specialized as expected.
That last part does not go on to mention that inlining of small specialized functions will often propagate across unspecialized calls and remove the performance cost. In fact, map
itself does not call the argument function in its body, it happens in one of several internal functions that gets inlined up several layers of calls.
julia> begin
foo1(x)=x+1
bar(f, x) = f(x) # specializes on f, so bar(foo1, x) inlines to x+1
baz(f, x) = bar(f, x) # not specialized on f, so only inlines to f(x)
foo1_2(x) = baz(foo1, x) # inlines to bar(foo1, x) to foo1(x) to x+1
using BenchmarkTools
end;
julia> @which(bar(foo1, 1)).specializations # specialized on foo1
svec(MethodInstance for bar(::typeof(foo1), ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> @which(baz(foo1, 1)).specializations # not specialized on foo1
svec(MethodInstance for baz(::Function, ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> @btime @noinline foo1($1); # loop has extra cost of function call
4.319 ns (0 allocations: 0 bytes)
julia> @btime @noinline bar($foo1, $1);
4.320 ns (0 allocations: 0 bytes)
julia> @btime @noinline baz($foo1, $1); # forces unspecialized call, otherwise inlined
20.943 ns (0 allocations: 0 bytes)
julia> @btime @noinline foo1_2($1);
4.319 ns (0 allocations: 0 bytes)
My guess is this reflects the inlining of process_matrix_redux(_f, ...)
into the benchmark loop. You can see this in my example above, before doing the benchmarking:
julia> baz(foo1, 1); (@which baz(foo1, 1)).specializations
svec(MethodInstance for baz(::Function, ::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> foo1_2(1); (@which baz(foo1, 1)).specializations
svec(MethodInstance for baz(::Function, ::Int64), MethodInstance for baz(::typeof(foo1), ::Int64), nothing, nothing, nothing, nothing, nothing, nothing)