Delays shown in Nsight Systems between HtoD memcopy and kernel launch when using CUDA.jl

Cibin_Joseph · July 24, 2024, 9:45pm

For the CUDA.jl minimal example shown below, I see a considerably large delay between the HtoD memcopy and the kernel launch. I’m not able to figure out what causes it. For a larger array, sometimes this delay is much lesser too. Is this part of the CUDA API calls, and something to be expected?

using CUDA
using Random

@inline function kernel!(a)
    i = threadIdx().x + blockDim().x*(blockIdx().x-1)
    a[i] = CUDA.cos(a[i]) + i^0.6 + CUDA.tan(a[i])
    return
end

function main()
    n = 2^12
    a = rand(n)
    
    threads = 256
    blocks = cld(n, threads)
    a_d = CuArray(a)
    @cuda threads=threads blocks=blocks kernel!(a_d)
    a = Array(a_d)

    return
end

main()
CUDA.@profile main()

The times reported by Nsight systems are:
HtoD memcopy: ~3 microseconds
Delay: ~157 microseconds
Kernel: ~18 microseconds
DtoH memcopy: ~3 microseconds

Additionally, for debugging scenarios like these, is there a way to make Nsight Systems record the CPU operations and function calls too?

giordano · July 24, 2024, 10:44pm

maleadt · July 25, 2024, 6:12pm

In addition, this makes it possible to figure out whether this is a GC stall (NVTX.jl · NVTX.jl), although I don’t see any free operations being queued in that gap, so it seems unlikely.

Cibin_Joseph · July 29, 2024, 3:14am

The NVTX annotations were helpful in determining where exactly the kernel launch begins and when the HtoD ends. I also did not get any blocks that had GC in the Nsight profiles. This is a fairly small case, so I’m ruling that out for now.
I still find that there’s a substantial delay between memcopy (both HtoD and DtoH) and kernel execution. I’m tempted to think there’s something the Julia API is stuck with.

If I wanted to try to figure this out? Do you guys have any suggestions on how I could go about doing that? Would it be worth it looking into the llvm code?

maleadt · July 29, 2024, 9:34am

I would recommend adding NVTX.@annotate annotations to functions, both in your application and in CUDA.jl (after ]deving the package)… In combination with interactive profiling and Revise.jl this should make it possible to quickly find where the delay is coming from.

Cibin_Joseph · July 29, 2024, 10:33pm

I tried going through and annotating the cufunction to see why it was taking so long to launch the GPU kernel and it seems like the call to methodinstance and cached_compilation takes quite some time. At this point, I’m realizing I’m way too deep down this rabbit hole. So, I’m going to sketch it up to compilation overheads and move on. Thanks!

function cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
    cuda = active_state()

    Base.@lock cufunction_lock begin
        # compile the function
        cache = compiler_cache(cuda.context)
        NVTX.@mark "compiler_cache"
        source = methodinstance(F, tt)
        NVTX.@mark "methodinstance"
        config = compiler_config(cuda.device; kwargs...)::CUDACompilerConfig
        NVTX.@mark "compiler_config"
        fun = GPUCompiler.cached_compilation(cache, source, config, compile, link)
        
        NVTX.@mark "cached_compilation"
        # create a callable object that captures the function instance. we don't need t
o think 
        # about world age here, as GPUCompiler already does and will return a different
 object 
        key = (objectid(source), hash(fun), f)
        NVTX.@mark "hash"
        kernel = get(_kernel_instances, key, nothing)
        NVTX.@mark "get"
        if kernel === nothing
            # create the kernel state object
            state = KernelState(create_exceptions!(fun.mod), UInt32(0))
            NVTX.@mark "kernelstate"
            
            kernel = HostKernel{F,tt}(f, fun, state)
            NVTX.@mark "hostkernel"
            _kernel_instances[key] = kernel
        end 
        NVTX.@mark "after if condition"
        return kernel::HostKernel{F,tt}
    end 
end

maleadt · July 31, 2024, 11:31am

Hmm, that shouldn’t be happening. Which version of Julia are you using? On 1.10+, those lookups should be fast.

Cibin_Joseph · July 31, 2024, 3:14pm

I’m working on Julia 1.10.4

maleadt · July 31, 2024, 8:27pm

In that case, can you file an issue on CUDA.jl with the reproducer from the top, your findings on which functions behave badly, and what versions of packages you were using exactly (i.e., a Manifest)?

Cibin_Joseph · July 31, 2024, 11:22pm

Github issue #2456 opened in CUDA.jl

Topic		Replies	Views
Profiling Julia CUDA code missing 'CUDA HW' GPU	7	965	February 9, 2022
CUDA Profiler General Usage cuda	4	121	August 28, 2024
Nsight compute from CUDA.jl and source annotation Performance gpu	3	727	March 25, 2021
Failed to profile CUDA.jl with Nsight Systems 2024.1.1 GPU profiling	4	516	February 13, 2024
Using NVIDIA Nsight Systems GPU	1	579	February 14, 2024

Delays shown in Nsight Systems between HtoD memcopy and kernel launch when using CUDA.jl

Related topics