Great question! You are running up against the specialization heuristics of the compiler: basically, the question is “should Julia compile a specialized version of every single function for every single argument combination?” In some cases that’s counterproductive, so at a certain point Julia’s compiler decides to punt and use runtime dispatch.
Here you have two challenges to specialization:
- splatting
- the
f
argument (which is an arbitrary function)
To be clear, to get ideal performance it has to compile a separate version for every combination of args
types and for every different f
, and in this case either one is enough on its own to prevent specialization (especially because f
is being called varargs, that might not happen in other cases). Consequently, you have to “solve” both problems to see great performance.
The following modifications suffice for me:
function apply4!(f::F, y, args::Vararg{T,N}) where {F,T,N}
for i in eachindex(args[1])
y[i*2 - i%2] = _map_i(f, i, args...)
end
return y
end
@inline _map_i(f::F, i, x1) where F = f(x1[i])
@inline _map_i(f::F, i, x1, x2) where F = f(x1[i], x2[i])
@inline _map_i(f::F, i, x1, x2, x3) where F = f(x1[i], x2[i], x3[i])
@inline _map_i(f::F, i, x1, x2, x3, x4, args...) where F = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)
I did two things here:
- the
f::F
syntax seems useless, but it turns out to force specialization on the function argument - the
::Vararg{T,N}
forces it to specialize the function for all different numbers of input arguments.
With these two changes I get the same performance from apply4!
that I get from apply1!
and apply2!
.
This might be more intuitive if we had a @specialize
macro, so that one could write
function apply4!(@specialize(f), y, @specialize(args...))
...
end