A implementation of ResNet-18 uses lot of GPU memory

In my case your Flux implementation takes around 7 mins per epoch with batchsize of 64, but my GPU might not be as fast as yours. It’s quite busy, at 100%.
Tensorflow trains in 6 min per epoch or total?

Edit: are you using FP16 on RTX2070?