You cannot add both percentages, so on the CPU 2.3% of the time of the trace is spent doing GPU-related things – which isn’t necessarily bad, if it’s expected that the CPU is doing other work concurrently with the GPU processing things – while separately the GPU is busy only 6% of the time. If you expect the operation to only perform GPU operations, that’s a low amount.
One possible, but typical reason is that you’re performing unrelated CPU work in between submitting GPU operations, failing to fully saturate the GPU. This can be worked around by increasing the size of operations, or by minimizing / optimizing the amount of work done on the CPU. The NSight Systems visual profiler can help here (check the CUDA.jl docs), because a timeline is always easier to interpret than a textual report. In addition, you can annotate your application with NVTX ranges (from NVTX.jl) to visualize on the timeline where time is spent.