Improving GPU performance for symbolic regression

Alright so that big red bar in that flame graph turned out to be this call in CUDA.jl in execution.jl:

function (kernel::HostKernel)(args...; threads::CuDim=1, blocks::CuDim=1, kwargs...)
    call(kernel, map(cudaconvert, args)...; threads, blocks, kwargs...)
end

If I simply apply this change to it:

-  function (kernel::HostKernel)(args...; threads::CuDim=1, blocks::CuDim=1, kwargs...)
+  function (kernel::HostKernel)(args::Vararg{Any,M}; threads::CuDim=1, blocks::CuDim=1, kwargs...) where {M}

this makes it specialize and gets another 20% improvement!

This is because Vararg doesn’t automatically specialize Performance Tips · The Julia Language – so that map(cudaconvert, args)... call causes a bit of damage.

@maleadt are you okay if we upstream this change?