CUDA Profiler

I was running CUDA.@profile on one of my functions and noticed a large portion of time is missing from the report. I don’t have an MWE, but I would like to still ask. In general, what are the usual sources or suspects in cases like these.

As you can see from the output, less than 10% of the total time is accounted for. Are there any tools I can use to get a better idea for what is going wrong?

julia> CUDA.@profile BcdiCore.loss(state, true, true, false)
Profiler ran for 47.09 ms, capturing 3748 events.

Host-side activity: calling CUDA APIs took 1.08 ms (2.29% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                    │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────┤
│    0.69% │  323.06 µs │    44 │   7.34 µs ± 4.96   (  4.05 ‥ 29.8)    │ cuLaunchKernel          │
│    0.19% │   91.55 µs │     3 │  30.52 µs ± 10.62  ( 18.84 ‥ 39.58)   │ cuMemcpyDtoHAsync       │
│    0.11% │   53.64 µs │     8 │   6.71 µs ± 2.59   (  4.29 ‥ 11.44)   │ cudaLaunchKernel        │
│    0.11% │   50.78 µs │     4 │   12.7 µs ± 6.65   (  5.96 ‥ 20.5)    │ cudaMemsetAsync         │
│    0.08% │   36.95 µs │     6 │   6.16 µs ± 3.54   (  3.58 ‥ 13.11)   │ cuMemAllocFromPoolAsync │
│    0.06% │   26.23 µs │     6 │   4.37 µs ± 2.73   (  2.15 ‥ 8.11)    │ cuStreamSynchronize     │
│    0.03% │   12.16 µs │     8 │   1.52 µs ± 1.9    (  0.24 ‥ 5.25)    │ cudaSetDevice           │
│    0.01% │    3.81 µs │     4 │ 953.67 ns ± 825.91 (238.42 ‥ 1668.93) │ cudaGetDevice           │
│    0.00% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuStreamIsCapturing     │
│    0.00% │  476.84 ns │     7 │  68.12 ns ± 116.34 (   0.0 ‥ 238.42)  │ cudaGetLastError        │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 2.85 ms (6.04% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                                                                                                                                                                                          ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.86% │    1.35 ms │     1 │                                     │ _ZN9cufinufft12spreadinterp22spread_3d_nupts_drivenIdLi1EEEvPKT_S4_S4_PKN17cuda_complex_implIS2_E4typeEPS7_iiiiiS2_S2_S2_PKii                                                                 ⋯
│    2.21% │    1.04 ms │     3 │ 346.18 µs ± 0.72   (345.47 ‥ 346.9) │ _ZN9cufinufft12spreadinterp22interp_3d_nupts_drivenIdLi1EEEvPKT_S4_S4_PN17cuda_complex_implIS2_E4typeEPKS7_iiiiiS2_S2_S2_Pii                                                                  ⋯
│    0.29% │  136.61 µs │     8 │  17.08 µs ± 0.25   ( 16.93 ‥ 17.64) │ _Z11regular_fftILj100E3EPTIJLj10EEELj16ELj6EL9padding_t6EL9twiddle_t0EL20loadstore_modifier_t2EL8layout_t1EjdEv18kernel_arguments_tIT7_E                                                      ⋯
│    0.15% │   69.86 µs │     4 │  17.46 µs ± 0.12   (  17.4 ‥ 17.64) │ _Z11regular_fftILj100E3EPTIJLj10EEELj16ELj5EL9padding_t6EL9twiddle_t0EL20loadstore_modifier_t2EL8layout_t0EjdEv18kernel_arguments_tIT7_E                                                      ⋯
│    0.09% │   43.15 µs │     4 │  10.79 µs ± 0.53   ( 10.25 ‥ 11.44) │ [set device memory]                                                                                                                                                                           ⋯
│    0.09% │   41.01 µs │     7 │   5.86 µs ± 0.47   (  5.25 ‥ 6.44)  │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE8identityS6_I8ExtrudedIS0_IS1_IS2_ELi ⋯
│    0.05% │   22.41 µs │     3 │   7.47 µs ± 0.14   (  7.39 ‥ 7.63)  │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EE ⋯
│    0.03% │   14.78 µs │     3 │   4.93 µs ± 0.14   (  4.77 ‥ 5.01)  │ _ZN9cufinufft10deconvolve10amplify_3dIdEEviiiiiiPN17cuda_complex_implIT_E4typeES6_PS3_S7_S7_                                                                                                  ⋯
│    0.03% │   13.83 µs │     1 │                                     │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_IS3_IS4_ILi3ES5_EvS9_S6_IS8_8Ext ⋯
│    0.03% │   13.83 µs │     1 │                                     │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11B ⋯
│    0.03% │   12.87 µs │     3 │   4.29 µs ± 0.0    (  4.29 ‥ 4.29)  │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3_ILi1E ⋯
│    0.02% │   11.21 µs │     3 │   3.74 µs ± 0.14   (  3.58 ‥ 3.81)  │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3_ILi1E ⋯
│    0.02% │   10.01 µs │     2 │   5.01 µs ± 0.0    (  5.01 ‥ 5.01)  │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArray ⋯
│    0.02% │    9.54 µs │     1 │                                     │ _Z22partial_mapreduce_grid8identity1_5Int6416CartesianIndicesILi3E5TupleI5OneToIS1_ES4_IS1_ES4_IS1_EEES2_ILi3ES3_IS4_IS1_ES4_IS1_ES4_IS1_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11Broadc ⋯
│    0.02% │    9.54 µs │     3 │   3.18 µs ± 0.14   (   3.1 ‥ 3.34)  │ _Z3_3415CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi1ELi1EES6_I4BoolES6_IS8 ⋯
│    0.02% │    8.82 µs │     1 │                                     │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11B ⋯
│    0.02% │    8.58 µs │     1 │                                     │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EE ⋯
│    0.02% │    7.63 µs │     3 │   2.54 µs ± 0.14   (  2.38 ‥ 2.62)  │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE8identityS5_IS7_EES7_                                           ⋯
│    0.02% │    7.15 µs │     3 │   2.38 µs ± 0.24   (  2.15 ‥ 2.62)  │ [copy device to pageable memory]                                                                                                                                                              ⋯
│    0.01% │    5.72 µs │     1 │                                     │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_IS2_8ExtrudedIS0_IS1_IS2_ELi3ELi ⋯
│    0.01% │    4.77 µs │     1 │                                     │ _ZN9cufinufft10deconvolve13deconvolve_3dIdEEviiiiiiPN17cuda_complex_implIT_E4typeES6_PS3_S7_S7_                                                                                               ⋯
│    0.01% │    4.77 µs │     1 │                                     │ _Z22partial_mapreduce_grid8identity1_5Int6416CartesianIndicesILi4E5TupleI5OneToIS1_ES4_IS1_ES4_IS1_ES4_IS1_EEES2_ILi4ES3_IS4_IS1_ES4_IS1_ES4_IS1_ES4_IS1_EEE3ValILitrueEE13CuDeviceArrayIS1_L ⋯
│    0.01% │    3.34 µs │     1 │                                     │ _Z3_3415CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE8identityS6_I8ExtrudedIS0_IS1_IS2_ELi1ELi1EES6_I4Bool ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                       1 column omitted

You cannot add both percentages, so on the CPU 2.3% of the time of the trace is spent doing GPU-related things – which isn’t necessarily bad, if it’s expected that the CPU is doing other work concurrently with the GPU processing things – while separately the GPU is busy only 6% of the time. If you expect the operation to only perform GPU operations, that’s a low amount.

One possible, but typical reason is that you’re performing unrelated CPU work in between submitting GPU operations, failing to fully saturate the GPU. This can be worked around by increasing the size of operations, or by minimizing / optimizing the amount of work done on the CPU. The NSight Systems visual profiler can help here (check the CUDA.jl docs), because a timeline is always easier to interpret than a textual report. In addition, you can annotate your application with NVTX ranges (from NVTX.jl) to visualize on the timeline where time is spent.

Awesome! Using NSight Systems really cleared things up for me. I had a mapreduce in the middle of my function that, I think, caused CUDA to synchronize. This blocked the GPU from continuing until it was done.

On a side not, I couldn’t get NVTX ranges to work. Do they create their own scope? I just had undefined variables when I added them.

Ah, you were probably reducing to a scalar then. If you reduce to a one-element array, the operation remains asynchronous. Also see Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host by pxl-th · Pull Request #550 · JuliaGPU/GPUArrays.jl · GitHub

Yes, that is exactly right. I have updated it now and the output of @profile is looking much better.