Yes, I was talking about host memory.
So after several weeks of experiments, I’ve observed the following:
- Increasing the batch size seems to alleviate the problem for smaller networks. For example, increasing the batch size from 16 to 256 for ResNet18 results in memory usage periodically decreasing after gradually increasing for multiple epochs. Maybe large batches put more pressure on the GC? Unfortunately, this won’t work for larger models like ResNet50, where the largest batch size I can use is 32 before getting an OOM error.
- After training for 50 epochs on ImageNet, I found that transformers are affected, but to a much smaller degree than convolutional models. For example, memory consumption with ViT-B only increases by a few GB after 10 epochs, compared to ResNet50, which grows by over 40GB in the same time period.
- VRAM usage remains constant from start to finish, while host memory gradually grows. At a certain point, GPU usage declines significantly and
CUDA.@timeshows that memory management is taking up a significant chunk of resources. Unless I use a very large batch size, memory usage will continue to grow until the system is forced to use swap memory. Note that the slowdown occurs before this point, so it isn’t due to slow disk-reads.
Here’s the result of calling CUDA.@time on a forward pass at the start of training:
0.039036 seconds (50.43 k CPU allocations: 4.101 MiB, 10.63% gc time) (633 GPU allocations: 4.283 GiB, 5.60% memmgmt time)
0.041121 seconds (50.43 k CPU allocations: 4.098 MiB, 16.88% gc time) (633 GPU allocations: 4.283 GiB, 5.21% memmgmt time)
0.041562 seconds (50.43 k CPU allocations: 4.095 MiB, 16.48% gc time) (633 GPU allocations: 4.283 GiB, 5.11% memmgmt time)
0.039033 seconds (50.43 k CPU allocations: 4.089 MiB, 9.66% gc time) (633 GPU allocations: 4.283 GiB, 5.12% memmgmt time)
0.041621 seconds (50.43 k CPU allocations: 4.101 MiB, 16.08% gc time) (633 GPU allocations: 4.283 GiB, 5.13% memmgmt time)
0.041593 seconds (50.43 k CPU allocations: 4.098 MiB, 16.70% gc time) (633 GPU allocations: 4.283 GiB, 4.79% memmgmt time)
0.038278 seconds (50.43 k CPU allocations: 4.095 MiB, 9.60% gc time) (633 GPU allocations: 4.283 GiB, 5.24% memmgmt time)
0.041335 seconds (50.43 k CPU allocations: 4.083 MiB, 16.41% gc time) (633 GPU allocations: 4.283 GiB, 4.77% memmgmt time)
And this is after 10 epochs:
0.511448 seconds (50.63 k CPU allocations: 4.139 MiB, 91.59% gc time) (633 GPU allocations: 4.282 GiB, 94.36% memmgmt time)
0.200058 seconds (50.43 k CPU allocations: 4.123 MiB, 79.52% gc time) (633 GPU allocations: 4.282 GiB, 1.73% memmgmt time)
0.436915 seconds (51.15 k CPU allocations: 4.150 MiB, 91.40% gc time) (633 GPU allocations: 4.282 GiB, 62.98% memmgmt time)
0.200881 seconds (50.77 k CPU allocations: 4.134 MiB, 79.05% gc time) (633 GPU allocations: 4.282 GiB, 82.68% memmgmt time)
0.229609 seconds (50.76 k CPU allocations: 4.131 MiB, 82.70% gc time) (633 GPU allocations: 4.282 GiB, 1.25% memmgmt time)
0.179896 seconds (50.77 k CPU allocations: 4.130 MiB, 77.77% gc time) (633 GPU allocations: 4.282 GiB, 79.87% memmgmt time)
0.172918 seconds (50.75 k CPU allocations: 4.125 MiB, 77.78% gc time) (633 GPU allocations: 4.282 GiB, 1.52% memmgmt time)
Here’s the output of nvtop at the start of training and after 10 epochs on ImageNet with ViT-B:
As you can see, GPU memory is the same, but host memory has grown from 3579MB to 10254MB. The results are much worse with convolutionial models, which require me to restart the process every couple of epochs due to memory growth and training slow-down.

