For the CUDA.jl minimal example shown below, I see a considerably large delay between the HtoD memcopy and the kernel launch. I’m not able to figure out what causes it. For a larger array, sometimes this delay is much lesser too. Is this part of the CUDA API calls, and something to be expected?
using CUDA
using Random
@inline function kernel!(a)
i = threadIdx().x + blockDim().x*(blockIdx().x-1)
a[i] = CUDA.cos(a[i]) + i^0.6 + CUDA.tan(a[i])
return
end
function main()
n = 2^12
a = rand(n)
threads = 256
blocks = cld(n, threads)
a_d = CuArray(a)
@cuda threads=threads blocks=blocks kernel!(a_d)
a = Array(a_d)
return
end
main()
CUDA.@profile main()
The times reported by Nsight systems are:
HtoD memcopy: ~3 microseconds
Delay: ~157 microseconds
Kernel: ~18 microseconds
DtoH memcopy: ~3 microseconds
Additionally, for debugging scenarios like these, is there a way to make Nsight Systems record the CPU operations and function calls too?
In addition, this makes it possible to figure out whether this is a GC stall (NVTX.jl · NVTX.jl), although I don’t see any free operations being queued in that gap, so it seems unlikely.
1 Like
The NVTX annotations were helpful in determining where exactly the kernel launch begins and when the HtoD ends. I also did not get any blocks that had GC in the Nsight profiles. This is a fairly small case, so I’m ruling that out for now.
I still find that there’s a substantial delay between memcopy (both HtoD and DtoH) and kernel execution. I’m tempted to think there’s something the Julia API is stuck with.
If I wanted to try to figure this out? Do you guys have any suggestions on how I could go about doing that? Would it be worth it looking into the llvm code?
I would recommend adding NVTX.@annotate
annotations to functions, both in your application and in CUDA.jl (after ]dev
ing the package)… In combination with interactive profiling and Revise.jl this should make it possible to quickly find where the delay is coming from.
2 Likes
I tried going through and annotating the cufunction to see why it was taking so long to launch the GPU kernel and it seems like the call to methodinstance
and cached_compilation
takes quite some time. At this point, I’m realizing I’m way too deep down this rabbit hole. So, I’m going to sketch it up to compilation overheads and move on. Thanks!
function cufunction(f::F, tt::TT=Tuple{}; kwargs...) where {F,TT}
cuda = active_state()
Base.@lock cufunction_lock begin
# compile the function
cache = compiler_cache(cuda.context)
NVTX.@mark "compiler_cache"
source = methodinstance(F, tt)
NVTX.@mark "methodinstance"
config = compiler_config(cuda.device; kwargs...)::CUDACompilerConfig
NVTX.@mark "compiler_config"
fun = GPUCompiler.cached_compilation(cache, source, config, compile, link)
NVTX.@mark "cached_compilation"
# create a callable object that captures the function instance. we don't need t
o think
# about world age here, as GPUCompiler already does and will return a different
object
key = (objectid(source), hash(fun), f)
NVTX.@mark "hash"
kernel = get(_kernel_instances, key, nothing)
NVTX.@mark "get"
if kernel === nothing
# create the kernel state object
state = KernelState(create_exceptions!(fun.mod), UInt32(0))
NVTX.@mark "kernelstate"
kernel = HostKernel{F,tt}(f, fun, state)
NVTX.@mark "hostkernel"
_kernel_instances[key] = kernel
end
NVTX.@mark "after if condition"
return kernel::HostKernel{F,tt}
end
end
Hmm, that shouldn’t be happening. Which version of Julia are you using? On 1.10+, those lookups should be fast.
I’m working on Julia 1.10.4
In that case, can you file an issue on CUDA.jl with the reproducer from the top, your findings on which functions behave badly, and what versions of packages you were using exactly (i.e., a Manifest)?
Github issue #2456 opened in CUDA.jl
1 Like