GPU-Kernel Profiling

Hello, i’m currently profiling a Julia deep learning project, mainly executed on the GPU. Using nvprof i get following profiling result:

==7880== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   44.95%  674.56ms    200000  3.3720us  2.6560us  13.408us  julia_getindex_kernel_6568(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, Tuple<Int64, CuDeviceArray<Float32, int=2, int=1>>, Slice<OneTo<CuDeviceArray<Float32, int=2, int=1>>>, CuDeviceArray<Float32, int=2, int=1>)
                   30.07%  451.22ms    300011  1.5040us  1.3440us  10.848us  [CUDA memcpy DtoH]
                   22.75%  341.41ms    100000  3.4140us  2.6560us  8.1270us  julia_getindex_kernel_6263(CuKernelContext, CuDeviceArray<Float64, int=1, int=1>, CuDeviceArray<Float64, int=2, int=1>, Tuple<Int64, CuDeviceArray<Float64, int=2, int=1>>, Slice<OneTo<CuDeviceArray<Float64, int=2, int=1>>>, CuDeviceArray<Float64, int=2, int=1>)
                    1.85%  27.809ms      1000  27.808us  22.048us  32.896us  julia_kernel_test_brain_step_5260(CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float64, int=2, int=1>, GatedRecurrentUnitNN<CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=2, int=1>>)
                    0.37%  5.5779ms      2001  2.7870us  2.2070us  81.407us  [CUDA memcpy HtoD]
                    0.00%  57.248us         1  57.248us  57.248us  57.248us  julia_kernel_test_brain_initialize_2989(CuDeviceArray<Float64, int=2, int=1>, GatedRecurrentUnitNN<CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=2, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=2, int=1>>)
                    0.00%  26.143us        12  2.1780us  1.1200us  4.1920us  [CUDA memset]
      API calls:   36.85%  11.1219s    300011  37.071us  27.100us  2.5703ms  cuMemcpyDtoHAsync
                   35.68%  10.7701s  12089100     890ns     100ns  1.1863ms  cuStreamQuery
                    9.86%  2.97444s    301001  9.8810us  7.9000us  2.3312ms  cuLaunchKernel
                    6.73%  2.03219s    301012  6.7510us  5.8000us  420.70us  cuStreamSynchronize
                    6.03%  1.82059s  14200232     128ns       0ns  1.1837ms  cuCtxGetCurrent
                    2.35%  709.92ms    302013  2.3500us     800ns  4.3831ms  cuMemAllocAsync
                    0.89%  269.16ms    302013     891ns     500ns  637.80us  cuMemFreeAsync
                    0.57%  171.06ms    302012     566ns     200ns  489.70us  cuPointerGetAttribute
                    0.49%  147.00ms    300000     489ns     300ns  362.90us  cuOccupancyMaxPotentialBlockSize
                    0.42%  127.71ms         1  127.71ms  127.71ms  127.71ms  cuDevicePrimaryCtxRetain
                    0.07%  20.049ms      1020  19.655us  5.5000us  302.10us  cuLaunchHostFunc
                    0.03%  9.8587ms      2001  4.9260us  3.2000us  122.50us  cuMemcpyHtoDAsync
                    0.02%  6.0101ms         4  1.5025ms  292.60us  5.0332ms  cuModuleLoadDataEx
                    0.00%  573.00us         4  143.25us  6.9000us  545.90us  cuMemHostAlloc
                    0.00%  410.50us         4  102.63us  23.000us  283.60us  cuModuleUnload
                    0.00%  111.30us        12  9.2750us  1.5000us  49.600us  cuMemsetD32Async
                    0.00%  52.200us         5  10.440us  6.1000us  22.600us  cuCtxPopCurrent
                    0.00%  39.100us         1  39.100us  39.100us  39.100us  cuStreamDestroy
                    0.00%  36.100us         8  4.5120us     600ns  10.900us  cuCtxSynchronize
                    0.00%  34.700us         1  34.700us  34.700us  34.700us  cuStreamCreate
                    0.00%  26.100us         1  26.100us  26.100us  26.100us  cuDeviceGetMemPool
                    0.00%  16.200us        11  1.4720us     100ns  4.0000us  cuDeviceGetCount
                    0.00%  14.200us        30     473ns     100ns  5.3000us  cuDeviceGetAttribute
                    0.00%  9.4000us         4  2.3500us     500ns  7.7000us  cuMemHostGetDevicePointer
                    0.00%  6.4000us        11     581ns     100ns  2.1000us  cuDriverGetVersion
                    0.00%  5.2000us         4  1.3000us     900ns  2.2000us  cuModuleGetFunction
                    0.00%  5.2000us         5  1.0400us     100ns  2.6000us  cuCtxPushCurrent
                    0.00%  3.0000us         4     750ns     200ns  2.0000us  cuDeviceGet
                    0.00%  2.0000us         1  2.0000us  2.0000us  2.0000us  cuCtxSetCurrent

More then half of the GPU activity is used by two “julia_getindex_kernel_xxxx” processes. I guess this is induced by using CPU memory in the kernel function, but i have little experience in kernel programming. If someone is experienced with GPU profiling, any details regarding the source of these processes are appreciated.

You are not using CPU memory in that kernel. The problem is more likely that you’re calling this kernel 200000 times, and copying memory from/to the GPU 30000 times (the line below). You need to try and fuse kernel calls together.