I’m attempting to reproduce Resnet (from Metalhead.jl) on ImageNet performance through simple demos for a single GPU setup (I’m on a RTX A4000, 16Bg, ~6000 CUDA cores) and stumbled upon 2 memory related challenges.
A first concern was that the ability to train on large batches was limited compared to other framework. For exmaple, in this Gluon/MXNet tutorial, Resnet 50 is trained wirth a batch size of 64 per GPU (256 split into 4 GPUs) with 12Gb each. In the following smaller reproducible example, resnet-base-curand.jl, batch size had to be limited to 20, which is over 3X smaller while having 16Gb memory rather than 12Gb. Although I’m aware that Julia isn’t currently super greedy on memory mgmt, are such limitations also expected on GPU code where the bulk should be CUDNN wrapers to convolution operators?
Another issue came from the build up of CPU RAM during training. This happens only when the program performs both the image read and transform plus the Flux gradiant pass. For example, there’s no RAM issue when performing only the image loading step such as in test-loader.jl nor in the above Flux only steps. I could thus only reproduce with resnet-base.jl which requires to have the ImageNet available on the machine.
CPU RAM grows up steadily through batches, starting from around 12Gb up to the full 64Gb available after roughly 1000 batches.
I’m unclear whether such memory leakage more likely concerns DataLoaders or Zygote as on one hand, the model gradient pass should not involve much CPU, though on the other, the test running only the DataLoaders without any Flux model works fine. Also, the CPU memory usage doesn’t seem to build up when launching with a single thread, as opposed to 6-8 threads which are needed for decent training speed. This memory caveat can be avoided by adding
batch % 200 == 0 && GC.gc(true)
within the loop for (batch, (x, y)) in enumerate(CuIterator(dtrain))
, though having to add such step seems an anomaly. Could this issue be related to some of the GC topics recently discussed at JuliaCon?
Finally, I’ve yet to match published results on Resnet34 and 50, although I got fairly close with 66%-68% top 1 accuracy. I’d be curious to know if anyone has had success doing so on a single GPU setup and would have a reprocible script. Such model fitting feels like a 101 for a DL framework, and so I’d have interest to see a Flux reproduible recipe to do that.
For info, 1 epoch for Resnet34 takes 3600 secs which I consider good, though it climbs to 7200 sec with resnet50, which is likely due to the limited batch size.