CUDA Profiler

maleadt · August 27, 2024, 8:22pm

You cannot add both percentages, so on the CPU 2.3% of the time of the trace is spent doing GPU-related things – which isn’t necessarily bad, if it’s expected that the CPU is doing other work concurrently with the GPU processing things – while separately the GPU is busy only 6% of the time. If you expect the operation to only perform GPU operations, that’s a low amount.

One possible, but typical reason is that you’re performing unrelated CPU work in between submitting GPU operations, failing to fully saturate the GPU. This can be worked around by increasing the size of operations, or by minimizing / optimizing the amount of work done on the CPU. The NSight Systems visual profiler can help here (check the CUDA.jl docs), because a timeline is always easier to interpret than a textual report. In addition, you can annotate your application with NVTX ranges (from NVTX.jl) to visualize on the timeline where time is spent.

Topic		Replies	Views
How would I retrieve the result from CUDA.@profile? General Usage	7	569	November 1, 2023
cudaMemcpyAsync: where is it used? GPU cuda	17	350	January 14, 2025
Profiling Julia CUDA code missing 'CUDA HW' GPU	7	929	February 9, 2022
Help Debugging GPU Performance Issue GPU gpu , debug , cuarrays	5	912	July 1, 2020
GPU-Kernel Profiling GPU kernel	1	414	September 20, 2022

CUDA Profiler

Related topics