Alright so that big red bar in that flame graph turned out to be this call in CUDA.jl in execution.jl
:
function (kernel::HostKernel)(args...; threads::CuDim=1, blocks::CuDim=1, kwargs...)
call(kernel, map(cudaconvert, args)...; threads, blocks, kwargs...)
end
If I simply apply this change to it:
- function (kernel::HostKernel)(args...; threads::CuDim=1, blocks::CuDim=1, kwargs...)
+ function (kernel::HostKernel)(args::Vararg{Any,M}; threads::CuDim=1, blocks::CuDim=1, kwargs...) where {M}
this makes it specialize and gets another 20% improvement!
This is because Vararg doesn’t automatically specialize Performance Tips · The Julia Language – so that map(cudaconvert, args)...
call causes a bit of damage.
@maleadt are you okay if we upstream this change?