I have a code written with CUDA.jl and I am trying to improve its performance. Upon CUDA.@profile, I learned that more than 60% of the host-side activity was spent on cudaMemcpyAsync:
julia> CUDA.@profile myfun(...)
...
Host-side activity: calling CUDA APIs took 397.28 ms (79.03% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼─────────────────────────────────────────────┼──────────────────────────┤
│ 60.98% │ 306.54 ms │ 200 │ 1.53 ms ± 1.53 ( 0.01 ‥ 3.22) │ cudaMemcpyAsync │
│ 6.55% │ 32.92 ms │ 8502 │ 3.87 µs ± 9.99 ( 2.62 ‥ 921.49) │ cuLaunchKernel │
│ 2.21% │ 11.1 ms │ 81120 │ 136.82 ns ± 4582.86 ( 0.0 ‥ 1.30391121e6) │ cuStreamGetCaptureInfo │
│ 0.50% │ 2.49 ms │ 800 │ 3.11 µs ± 2.03 ( 1.43 ‥ 22.65) │ cuMemAllocFromPoolAsync │
│ 0.46% │ 2.3 ms │ 600 │ 3.83 µs ± 1.01 ( 3.1 ‥ 23.37) │ cudaLaunchKernel │
│ 0.08% │ 413.42 µs │ 800 │ 516.77 ns ± 224.04 (238.42 ‥ 2622.6) │ cudaStreamIsCapturing │
...
I would like to reduce the number of cudaMemcpyAsync calls, but I am struggling to find out where it is called. The CUDA documentation says the function “[c]opies data between host and device.” I am not using any custom CUDA kernels, so I think copying between the host and device occurs only through copyto!() in my code. However, when I profile copyto!() between the host and device in either direction, I don’t see cudaMemcpyAsync is being used:
julia> d = CUDA.rand(1024,1024,1024);
julia> h = rand(Float32, 1024,1024,1024);
julia> CUDA.@profile copyto!(d, h)
Profiler ran for 895.97 ms, capturing 23 events.
Host-side activity: calling CUDA APIs took 895.87 ms (99.99% of the trace)
┌──────────┬────────────┬───────┬────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼────────────────────────┤
│ 99.99% │ 895.86 ms │ 1 │ cuMemcpyHtoDAsync │
│ 0.00% │ 2.38 µs │ 1 │ cuStreamSynchronize │
│ 0.00% │ 238.42 ns │ 1 │ cuStreamGetCaptureInfo │
└──────────┴────────────┴───────┴────────────────────────┘
Device-side activity: GPU was busy for 894.8 ms (99.87% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────┤
│ 99.87% │ 894.8 ms │ 1 │ [copy pageable to device memory] │
└──────────┴────────────┴───────┴──────────────────────────────────┘
julia> CUDA.@profile copyto!(h, d)
Profiler ran for 835.24 ms, capturing 30 events.
Host-side activity: calling CUDA APIs took 797.95 ms (95.54% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────┤
│ 95.53% │ 797.92 ms │ 1 │ │ cuMemcpyDtoHAsync │
│ 0.00% │ 7.63 µs │ 2 │ 3.81 µs ± 0.0 ( 3.81 ‥ 3.81) │ cuStreamSynchronize │
│ 0.00% │ 238.42 ns │ 1 │ │ cuStreamGetCaptureInfo │
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────┘
Device-side activity: GPU was busy for 796.72 ms (95.39% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────┤
│ 95.39% │ 796.72 ms │ 1 │ [copy device to pageable memory] │
└──────────┴────────────┴───────┴──────────────────────────────────┘
Where is cudaMemcpyAsync typically used in CUDA.jl? I dug in the CUDA.jl repository but could not find any explicit usage of the function.
More generally, are there ways to identify Julia functions in one’s code that call CUDA functions like cudaMemcpyAsync?
We don’t, so this is probably being called by a CUDA library. Some GPU-unfriendly operations actually fall back to computing on the CPU (like SVD, IIRC), which can require copies like that. Maybe you can identify it by looking at the number of calls, or by adding NVTX ranges to your application.
I was able to identify one place where cudaMemcpyAsync is used: normalize!:
julia> d = CUDA.rand(1024);
julia> CUDA.@profile normalize!(d)
Profiler ran for 1.98 s, capturing 39 events.
Host-side activity: calling CUDA APIs took 223.64 µs (0.01% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│ 0.01% │ 99.18 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.00% │ 57.94 µs │ 3 │ 19.31 µs ± 11.62 ( 5.96 ‥ 27.18) │ cudaLaunchKernel │
│ 0.00% │ 40.53 µs │ 2 │ 20.27 µs ± 3.71 ( 17.64 ‥ 22.89) │ cudaMemcpyAsync │
...
Is this something expected? If so, is there a way to perform normalize! on a CUDA array without copying data between the host and device? I call normalize! in a loop, so this is quite costly.
UPDATE. I find that norm also uses cudaMemcpyAsync:
julia> CUDA.@profile norm(d)
Profiler ran for 801.43 ms, capturing 28 events.
Host-side activity: calling CUDA APIs took 170.23 µs (0.02% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│ 0.01% │ 68.9 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.00% │ 39.58 µs │ 2 │ 19.79 µs ± 0.34 ( 19.55 ‥ 20.03) │ cudaMemcpyAsync │
│ 0.00% │ 36.0 µs │ 2 │ 18.0 µs ± 16.35 ( 6.44 ‥ 29.56) │ cudaLaunchKernel │
...
Ah right, I wasn’t familiar. Well, in that case the reason is simple: CUBLAS.nrm2 (as used by norm as used by normalize!) reduces to a scalar, so the operation needs to wait for the GPU to finish before being able to copy the scalar result back. All that time is “hidden” behind the time it takes to cudaMemcpyAsync, but that doesn’t mean the operation itself takes that long. This is clear when looking at the device results, which show how the copy only takes a fraction of the time it takes to execute the two kernels:
julia> CUDA.@profile CUBLAS.nrm2(d)
Profiler ran for 5.03 ms, capturing 29 events.
Host-side activity: calling CUDA APIs took 5.02 ms (99.66% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────────┤
│ 98.38% │ 4.95 ms │ 2 │ 2.48 ms ± 3.49 ( 0.01 ‥ 4.94) │ cudaMemcpyAsync │
│ 0.75% │ 37.67 µs │ 1 │ │ cudaFuncGetAttributes │
│ 0.45% │ 22.65 µs │ 2 │ 11.32 µs ± 8.94 ( 5.01 ‥ 17.64) │ cudaLaunchKernel │
│ 0.04% │ 2.15 µs │ 1 │ │ cudaEventRecord │
│ 0.02% │ 953.67 ns │ 5 │ 190.73 ns ± 199.48 ( 0.0 ‥ 476.84) │ cudaStreamGetCaptureInfo_v2 │
│ 0.02% │ 953.67 ns │ 1 │ │ cudaStreamSynchronize │
│ 0.00% │ 238.42 ns │ 4 │ 59.6 ns ± 119.21 ( 0.0 ‥ 238.42) │ cudaGetLastError │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 4.94 ms (98.14% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────
│ 98.09% │ 4.94 ms │ 2 │ 2.47 ms ± 3.49 ( 0.0 ‥ 4.93) │ _Z18nrm2_64addr_kernelIfffLi1ELi0ELi128EEv16c ⋯
│ 0.04% │ 1.91 µs │ 1 │ │ [copy device to pageable memory] ⋯
│ 0.01% │ 715.26 ns │ 1 │ │ [copy pageable to device memory] ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────
The 4.95 ms it “takes” to complete the cudaMemcpyAsync API call from the CPU’s point of view is actually 4.94 ms of kernel execution time + 1.91 µs of copy time.
The take away here is that you want to avoid synchronizing operations like that when possible. A “better” implementation of normalize! would compute the norm without synchronizing, and use that unmaterialized scalar as input to the normalization operation. These are the kind of things you may have to take into account when optimizing code for the GPU.
Thanks for the explanation! That makes sense. The only problem is that unlike your, my result does not show the matching host- and device-side timings:
Here, the host-side cudaMemcpyAsync took 904.08 µs, which is ~50X longer than the sum of the device-side times (14.31 + 1.91 + 1.19) µs. I am curious if you have any insights as to why this happens on my machine…
EDIT. I executed the code multiple times but got almost the same results, so I don’t think this is the time-to-first-plot (TTFP) issue.
All these operations happen as part of the CUBLAS nrm2 implementation, so it cannot be a Julia TTFP issue.
I’m not sure what’s happening here. At this point, I’d recommend moving to a more powerful profiler (NSight Systems), and hopefully the timeline reveals what’s up. The next / alternative step is to try this with a CUDA C reproducer, which should be easy enough (calling nrm2 on a large array is a simple sequence operations that ChatGPT or whatever could generate for you).
Also, could you post CUDA.versioninfo()? I want to ensure you’re using the latest version of the CUDA toolkit.