Hello, i’m currently profiling a Julia deep learning project, mainly executed on the GPU. Using nvprof i get following profiling result:
==7880== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 44.95% 674.56ms 200000 3.3720us 2.6560us 13.408us julia_getindex_kernel_6568(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, Tuple<Int64, CuDeviceArray<Float32, int=2, int=1>>, Slice<OneTo<CuDeviceArray<Float32, int=2, int=1>>>, CuDeviceArray<Float32, int=2, int=1>)
30.07% 451.22ms 300011 1.5040us 1.3440us 10.848us [CUDA memcpy DtoH]
22.75% 341.41ms 100000 3.4140us 2.6560us 8.1270us julia_getindex_kernel_6263(CuKernelContext, CuDeviceArray<Float64, int=1, int=1>, CuDeviceArray<Float64, int=2, int=1>, Tuple<Int64, CuDeviceArray<Float64, int=2, int=1>>, Slice<OneTo<CuDeviceArray<Float64, int=2, int=1>>>, CuDeviceArray<Float64, int=2, int=1>)
1.85% 27.809ms 1000 27.808us 22.048us 32.896us julia_kernel_test_brain_step_5260(CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float64, int=2, int=1>, GatedRecurrentUnitNN<CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<Float32, int=2, int=1>>)
0.37% 5.5779ms 2001 2.7870us 2.2070us 81.407us [CUDA memcpy HtoD]
0.00% 57.248us 1 57.248us 57.248us 57.248us julia_kernel_test_brain_initialize_2989(CuDeviceArray<Float64, int=2, int=1>, GatedRecurrentUnitNN<CuDeviceArray<Float32, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=2, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=3, int=1>, CuDeviceArray<GatedRecurrentUnitNN, int=2, int=1>>)
0.00% 26.143us 12 2.1780us 1.1200us 4.1920us [CUDA memset]
API calls: 36.85% 11.1219s 300011 37.071us 27.100us 2.5703ms cuMemcpyDtoHAsync
35.68% 10.7701s 12089100 890ns 100ns 1.1863ms cuStreamQuery
9.86% 2.97444s 301001 9.8810us 7.9000us 2.3312ms cuLaunchKernel
6.73% 2.03219s 301012 6.7510us 5.8000us 420.70us cuStreamSynchronize
6.03% 1.82059s 14200232 128ns 0ns 1.1837ms cuCtxGetCurrent
2.35% 709.92ms 302013 2.3500us 800ns 4.3831ms cuMemAllocAsync
0.89% 269.16ms 302013 891ns 500ns 637.80us cuMemFreeAsync
0.57% 171.06ms 302012 566ns 200ns 489.70us cuPointerGetAttribute
0.49% 147.00ms 300000 489ns 300ns 362.90us cuOccupancyMaxPotentialBlockSize
0.42% 127.71ms 1 127.71ms 127.71ms 127.71ms cuDevicePrimaryCtxRetain
0.07% 20.049ms 1020 19.655us 5.5000us 302.10us cuLaunchHostFunc
0.03% 9.8587ms 2001 4.9260us 3.2000us 122.50us cuMemcpyHtoDAsync
0.02% 6.0101ms 4 1.5025ms 292.60us 5.0332ms cuModuleLoadDataEx
0.00% 573.00us 4 143.25us 6.9000us 545.90us cuMemHostAlloc
0.00% 410.50us 4 102.63us 23.000us 283.60us cuModuleUnload
0.00% 111.30us 12 9.2750us 1.5000us 49.600us cuMemsetD32Async
0.00% 52.200us 5 10.440us 6.1000us 22.600us cuCtxPopCurrent
0.00% 39.100us 1 39.100us 39.100us 39.100us cuStreamDestroy
0.00% 36.100us 8 4.5120us 600ns 10.900us cuCtxSynchronize
0.00% 34.700us 1 34.700us 34.700us 34.700us cuStreamCreate
0.00% 26.100us 1 26.100us 26.100us 26.100us cuDeviceGetMemPool
0.00% 16.200us 11 1.4720us 100ns 4.0000us cuDeviceGetCount
0.00% 14.200us 30 473ns 100ns 5.3000us cuDeviceGetAttribute
0.00% 9.4000us 4 2.3500us 500ns 7.7000us cuMemHostGetDevicePointer
0.00% 6.4000us 11 581ns 100ns 2.1000us cuDriverGetVersion
0.00% 5.2000us 4 1.3000us 900ns 2.2000us cuModuleGetFunction
0.00% 5.2000us 5 1.0400us 100ns 2.6000us cuCtxPushCurrent
0.00% 3.0000us 4 750ns 200ns 2.0000us cuDeviceGet
0.00% 2.0000us 1 2.0000us 2.0000us 2.0000us cuCtxSetCurrent
More then half of the GPU activity is used by two “julia_getindex_kernel_xxxx” processes. I guess this is induced by using CPU memory in the kernel function, but i have little experience in kernel programming. If someone is experienced with GPU profiling, any details regarding the source of these processes are appreciated.