I gave the profiler a try (it’s the first time I use it).
Here are the “trimmed down” results for the forward pass on a batch on 512 images:
==11960== Profiling application: julia
==11960== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 34.19% 2.03996s 232 8.7929ms 1.4400us 127.79ms [CUDA memcpy HtoD]
16.30% 972.30ms 67 14.512ms 2.0727ms 134.15ms ptxcall_anonymous25_1
11.07% 660.39ms 30 22.013ms 14.045ms 27.541ms void cudnn::detail::implicit_convolve_sgemm
10.03% 598.66ms 12 49.888ms 49.718ms 50.754ms void cudnn::detail::implicit_convolve_sgemm
8.66% 516.77ms 60 8.6128ms 1.7418ms 55.664ms ptxcall_anonymous25_4
6.30% 376.12ms 15 25.075ms 16.165ms 31.079ms void cudnn::detail::implicit_convolve_sgemm
5.91% 352.84ms 5 70.568ms 49.790ms 99.667ms void cudnn::detail::implicit_convolve_sgemm
API calls: 38.19% 6.88944s 563 12.237ms 5.8000us 232.92ms cuMemAlloc
25.29% 4.56288s 262 17.416ms 9.9000us 317.13ms cuMemFree
15.71% 2.83392s 8 354.24ms 1.0000us 2.83391s cudaStreamCreateWithFlags
11.00% 1.98506s 230 8.6307ms 20.200us 32.151ms cuMemcpyHtoD
4.68% 844.44ms 7 120.63ms 600ns 608.43ms cudaFree
3.95% 712.55ms 10 71.255ms 932.20us 123.56ms cuModuleLoadDataEx
1.08% 194.43ms 1 194.43ms 194.43ms 194.43ms cuDevicePrimaryCtxRetain