There’s interesting work being done to scale NNs down not just up (as with GPT-3), both for NLP and computer vision. Still, GPT-3 is huge (ALBERTA I mentioned much smaller), so multi-GPU seems needed for sure (at least for good NLP now).
I’m curious, if the network itself doesn’t need to be that big (say fits on one memory), but the problem is the dataset/training, what happens if you spit it 2 or N ways and train independently, can you in general (or say for images only) combine two such trained networks? Isn’t that what people call minibatching? I could see it maybe not working for NLP.
And can you simply use:
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet
E.g. Julia has by now (for a long time, while it’s not the post popular framework, for Julia or otherwise) official support for MXNet, and as I posted, there’s a PyTourch wrapper, while the Tensorflow one is a bit outdated.