A implementation of ResNet-18 uses lot of GPU memory

I gave the profiler a try (it’s the first time I use it).
Here are the “trimmed down” results for the forward pass on a batch on 512 images:

==11960== Profiling application: julia
==11960== Profiling result:
 Type             Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   34.19%  2.03996s       232  8.7929ms  1.4400us  127.79ms  [CUDA memcpy HtoD]
                   16.30%  972.30ms        67  14.512ms  2.0727ms  134.15ms  ptxcall_anonymous25_1
                   11.07%  660.39ms        30  22.013ms  14.045ms  27.541ms  void cudnn::detail::implicit_convolve_sgemm
                   10.03%  598.66ms        12  49.888ms  49.718ms  50.754ms  void cudnn::detail::implicit_convolve_sgemm
                    8.66%  516.77ms        60  8.6128ms  1.7418ms  55.664ms  ptxcall_anonymous25_4
                    6.30%  376.12ms        15  25.075ms  16.165ms  31.079ms  void cudnn::detail::implicit_convolve_sgemm
                    5.91%  352.84ms         5  70.568ms  49.790ms  99.667ms  void cudnn::detail::implicit_convolve_sgemm
      API calls:   38.19%  6.88944s       563  12.237ms  5.8000us  232.92ms  cuMemAlloc
                   25.29%  4.56288s       262  17.416ms  9.9000us  317.13ms  cuMemFree
                   15.71%  2.83392s         8  354.24ms  1.0000us  2.83391s  cudaStreamCreateWithFlags
                   11.00%  1.98506s       230  8.6307ms  20.200us  32.151ms  cuMemcpyHtoD
                    4.68%  844.44ms         7  120.63ms     600ns  608.43ms  cudaFree
                    3.95%  712.55ms        10  71.255ms  932.20us  123.56ms  cuModuleLoadDataEx
                    1.08%  194.43ms         1  194.43ms  194.43ms  194.43ms  cuDevicePrimaryCtxRetain