I was running CUDA.@profile
on one of my functions and noticed a large portion of time is missing from the report. I don’t have an MWE, but I would like to still ask. In general, what are the usual sources or suspects in cases like these.
As you can see from the output, less than 10% of the total time is accounted for. Are there any tools I can use to get a better idea for what is going wrong?
julia> CUDA.@profile BcdiCore.loss(state, true, true, false)
Profiler ran for 47.09 ms, capturing 3748 events.
Host-side activity: calling CUDA APIs took 1.08 ms (2.29% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────┤
│ 0.69% │ 323.06 µs │ 44 │ 7.34 µs ± 4.96 ( 4.05 ‥ 29.8) │ cuLaunchKernel │
│ 0.19% │ 91.55 µs │ 3 │ 30.52 µs ± 10.62 ( 18.84 ‥ 39.58) │ cuMemcpyDtoHAsync │
│ 0.11% │ 53.64 µs │ 8 │ 6.71 µs ± 2.59 ( 4.29 ‥ 11.44) │ cudaLaunchKernel │
│ 0.11% │ 50.78 µs │ 4 │ 12.7 µs ± 6.65 ( 5.96 ‥ 20.5) │ cudaMemsetAsync │
│ 0.08% │ 36.95 µs │ 6 │ 6.16 µs ± 3.54 ( 3.58 ‥ 13.11) │ cuMemAllocFromPoolAsync │
│ 0.06% │ 26.23 µs │ 6 │ 4.37 µs ± 2.73 ( 2.15 ‥ 8.11) │ cuStreamSynchronize │
│ 0.03% │ 12.16 µs │ 8 │ 1.52 µs ± 1.9 ( 0.24 ‥ 5.25) │ cudaSetDevice │
│ 0.01% │ 3.81 µs │ 4 │ 953.67 ns ± 825.91 (238.42 ‥ 1668.93) │ cudaGetDevice │
│ 0.00% │ 715.26 ns │ 4 │ 178.81 ns ± 119.21 ( 0.0 ‥ 238.42) │ cuStreamIsCapturing │
│ 0.00% │ 476.84 ns │ 7 │ 68.12 ns ± 116.34 ( 0.0 ‥ 238.42) │ cudaGetLastError │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────┘
Device-side activity: GPU was busy for 2.85 ms (6.04% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 2.86% │ 1.35 ms │ 1 │ │ _ZN9cufinufft12spreadinterp22spread_3d_nupts_drivenIdLi1EEEvPKT_S4_S4_PKN17cuda_complex_implIS2_E4typeEPS7_iiiiiS2_S2_S2_PKii ⋯
│ 2.21% │ 1.04 ms │ 3 │ 346.18 µs ± 0.72 (345.47 ‥ 346.9) │ _ZN9cufinufft12spreadinterp22interp_3d_nupts_drivenIdLi1EEEvPKT_S4_S4_PN17cuda_complex_implIS2_E4typeEPKS7_iiiiiS2_S2_S2_Pii ⋯
│ 0.29% │ 136.61 µs │ 8 │ 17.08 µs ± 0.25 ( 16.93 ‥ 17.64) │ _Z11regular_fftILj100E3EPTIJLj10EEELj16ELj6EL9padding_t6EL9twiddle_t0EL20loadstore_modifier_t2EL8layout_t1EjdEv18kernel_arguments_tIT7_E ⋯
│ 0.15% │ 69.86 µs │ 4 │ 17.46 µs ± 0.12 ( 17.4 ‥ 17.64) │ _Z11regular_fftILj100E3EPTIJLj10EEELj16ELj5EL9padding_t6EL9twiddle_t0EL20loadstore_modifier_t2EL8layout_t0EjdEv18kernel_arguments_tIT7_E ⋯
│ 0.09% │ 43.15 µs │ 4 │ 10.79 µs ± 0.53 ( 10.25 ‥ 11.44) │ [set device memory] ⋯
│ 0.09% │ 41.01 µs │ 7 │ 5.86 µs ± 0.47 ( 5.25 ‥ 6.44) │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE8identityS6_I8ExtrudedIS0_IS1_IS2_ELi ⋯
│ 0.05% │ 22.41 µs │ 3 │ 7.47 µs ± 0.14 ( 7.39 ‥ 7.63) │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EE ⋯
│ 0.03% │ 14.78 µs │ 3 │ 4.93 µs ± 0.14 ( 4.77 ‥ 5.01) │ _ZN9cufinufft10deconvolve10amplify_3dIdEEviiiiiiPN17cuda_complex_implIT_E4typeES6_PS3_S7_S7_ ⋯
│ 0.03% │ 13.83 µs │ 1 │ │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_IS3_IS4_ILi3ES5_EvS9_S6_IS8_8Ext ⋯
│ 0.03% │ 13.83 µs │ 1 │ │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11B ⋯
│ 0.03% │ 12.87 µs │ 3 │ 4.29 µs ± 0.0 ( 4.29 ‥ 4.29) │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3_ILi1E ⋯
│ 0.02% │ 11.21 µs │ 3 │ 3.74 µs ± 0.14 ( 3.58 ‥ 3.81) │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S5_I8ExtrudedIS0_IS1_Li1ELi1EES5_I4BoolES5_IS7_EES2_IS3_ILi1E ⋯
│ 0.02% │ 10.01 µs │ 2 │ 5.01 µs ± 0.0 ( 5.01 ‥ 5.01) │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArray ⋯
│ 0.02% │ 9.54 µs │ 1 │ │ _Z22partial_mapreduce_grid8identity1_5Int6416CartesianIndicesILi3E5TupleI5OneToIS1_ES4_IS1_ES4_IS1_EEES2_ILi3ES3_IS4_IS1_ES4_IS1_ES4_IS1_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11Broadc ⋯
│ 0.02% │ 9.54 µs │ 3 │ 3.18 µs ± 0.14 ( 3.1 ‥ 3.34) │ _Z3_3415CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi1ELi1EES6_I4BoolES6_IS8 ⋯
│ 0.02% │ 8.82 µs │ 1 │ │ _Z22partial_mapreduce_grid8identity1_7Float6416CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11B ⋯
│ 0.02% │ 8.58 µs │ 1 │ │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EE ⋯
│ 0.02% │ 7.63 µs │ 3 │ 2.54 µs ± 0.14 ( 2.38 ‥ 2.62) │ _Z3_3415CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE8identityS5_IS7_EES7_ ⋯
│ 0.02% │ 7.15 µs │ 3 │ 2.38 µs ± 0.24 ( 2.15 ‥ 2.62) │ [copy device to pageable memory] ⋯
│ 0.01% │ 5.72 µs │ 1 │ │ _Z3_3515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedI12CuArrayStyleILi3E12DeviceMemoryE5TupleI5OneToI5Int64ES7_IS8_ES7_IS8_EE1_S6_IS2_8ExtrudedIS0_IS1_IS2_ELi3ELi ⋯
│ 0.01% │ 4.77 µs │ 1 │ │ _ZN9cufinufft10deconvolve13deconvolve_3dIdEEviiiiiiPN17cuda_complex_implIT_E4typeES6_PS3_S7_S7_ ⋯
│ 0.01% │ 4.77 µs │ 1 │ │ _Z22partial_mapreduce_grid8identity1_5Int6416CartesianIndicesILi4E5TupleI5OneToIS1_ES4_IS1_ES4_IS1_ES4_IS1_EEES2_ILi4ES3_IS4_IS1_ES4_IS1_ES4_IS1_ES4_IS1_EEE3ValILitrueEE13CuDeviceArrayIS1_L ⋯
│ 0.01% │ 3.34 µs │ 1 │ │ _Z3_3415CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi1ELi1EE11BroadcastedI12CuArrayStyleILi1E12DeviceMemoryE5TupleI5OneToI5Int64EE8identityS6_I8ExtrudedIS0_IS1_IS2_ELi1ELi1EES6_I4Bool ⋯
└──────────┴────────────┴───────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 column omitted