Splatting arguments causes ~30x slow down

Great question! You are running up against the specialization heuristics of the compiler: basically, the question is “should Julia compile a specialized version of every single function for every single argument combination?” In some cases that’s counterproductive, so at a certain point Julia’s compiler decides to punt and use runtime dispatch.

Here you have two challenges to specialization:

  • splatting
  • the f argument (which is an arbitrary function)

To be clear, to get ideal performance it has to compile a separate version for every combination of args types and for every different f, and in this case either one is enough on its own to prevent specialization (especially because f is being called varargs, that might not happen in other cases). Consequently, you have to “solve” both problems to see great performance.

The following modifications suffice for me:

function apply4!(f::F, y, args::Vararg{T,N}) where {F,T,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = _map_i(f, i, args...)
    end
    return y
end

@inline _map_i(f::F, i, x1) where F = f(x1[i])
@inline _map_i(f::F, i, x1, x2) where F = f(x1[i], x2[i])
@inline _map_i(f::F, i, x1, x2, x3) where F = f(x1[i], x2[i], x3[i])
@inline _map_i(f::F, i, x1, x2, x3, x4, args...) where F = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)

I did two things here:

  • the f::F syntax seems useless, but it turns out to force specialization on the function argument
  • the ::Vararg{T,N} forces it to specialize the function for all different numbers of input arguments.

With these two changes I get the same performance from apply4! that I get from apply1! and apply2!.

This might be more intuitive if we had a @specialize macro, so that one could write

function apply4!(@specialize(f), y, @specialize(args...))
    ...
end
26 Likes