Delays shown in Nsight Systems between HtoD memcopy and kernel launch when using CUDA.jl

For the CUDA.jl minimal example shown below, I see a considerably large delay between the HtoD memcopy and the kernel launch. I’m not able to figure out what causes it. For a larger array, sometimes this delay is much lesser too. Is this part of the CUDA API calls, and something to be expected?

using CUDA
using Random

@inline function kernel!(a)
    i = threadIdx().x + blockDim().x*(blockIdx().x-1)
    a[i] = CUDA.cos(a[i]) + i^0.6 + CUDA.tan(a[i])
    return
end

function main()
    n = 2^12
    a = rand(n)
    
    threads = 256
    blocks = cld(n, threads)
    a_d = CuArray(a)
    @cuda threads=threads blocks=blocks kernel!(a_d)
    a = Array(a_d)

    return
end

main()
CUDA.@profile main()

The times reported by Nsight systems are:
HtoD memcopy: ~3 microseconds
Delay: ~157 microseconds
Kernel: ~18 microseconds
DtoH memcopy: ~3 microseconds

Additionally, for debugging scenarios like these, is there a way to make Nsight Systems record the CPU operations and function calls too?

1 Like

In addition, this makes it possible to figure out whether this is a GC stall (NVTX.jl · NVTX.jl), although I don’t see any free operations being queued in that gap, so it seems unlikely.

1 Like

The NVTX annotations were helpful in determining where exactly the kernel launch begins and when the HtoD ends. I also did not get any blocks that had GC in the Nsight profiles. This is a fairly small case, so I’m ruling that out for now.
I still find that there’s a substantial delay between memcopy (both HtoD and DtoH) and kernel execution. I’m tempted to think there’s something the Julia API is stuck with.

If I wanted to try to figure this out? Do you guys have any suggestions on how I could go about doing that? Would it be worth it looking into the llvm code?

I would recommend adding NVTX.@annotate annotations to functions, both in your application and in CUDA.jl (after ]deving the package)… In combination with interactive profiling and Revise.jl this should make it possible to quickly find where the delay is coming from.

2 Likes

I tried going through and annotating the cufunction to see why it was taking so long to launch the GPU kernel and it seems like the call to methodinstance and cached_compilation takes quite some time. At this point, I’m realizing I’m way too deep down this rabbit hole. So, I’m going to sketch it up to compilation overheads and move on. Thanks!

function cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
    cuda = active_state()

    Base.@lock cufunction_lock begin
        # compile the function
        cache = compiler_cache(cuda.context)
        NVTX.@mark "compiler_cache"
        source = methodinstance(F, tt)
        NVTX.@mark "methodinstance"
        config = compiler_config(cuda.device; kwargs...)::CUDACompilerConfig
        NVTX.@mark "compiler_config"
        fun = GPUCompiler.cached_compilation(cache, source, config, compile, link)
        
        NVTX.@mark "cached_compilation"
        # create a callable object that captures the function instance. we don't need t
o think 
        # about world age here, as GPUCompiler already does and will return a different
 object 
        key = (objectid(source), hash(fun), f)
        NVTX.@mark "hash"
        kernel = get(_kernel_instances, key, nothing)
        NVTX.@mark "get"
        if kernel === nothing
            # create the kernel state object
            state = KernelState(create_exceptions!(fun.mod), UInt32(0))
            NVTX.@mark "kernelstate"
            
            kernel = HostKernel{F,tt}(f, fun, state)
            NVTX.@mark "hostkernel"
            _kernel_instances[key] = kernel
        end 
        NVTX.@mark "after if condition"
        return kernel::HostKernel{F,tt}
    end 
end 

Hmm, that shouldn’t be happening. Which version of Julia are you using? On 1.10+, those lookups should be fast.

I’m working on Julia 1.10.4

In that case, can you file an issue on CUDA.jl with the reproducer from the top, your findings on which functions behave badly, and what versions of packages you were using exactly (i.e., a Manifest)?

Github issue #2456 opened in CUDA.jl

1 Like