A recent algorithm developed by Rice University + Intel claims a ~3x speedup on CPU only as compared to Nvidia V100 and 10x speedup over Tensorflow CPU.
https://insidehpc.com/2020/03/slide-algorithm-for-training-deep-neural-nets-faster-on-cpus-than-gpus/
https://arxiv.org/abs/1903.03129
https://github.com/keroro824/HashingDeepLearning
A lot of the details are over my head, so I won’t even attempt to describe it. (Not a very useful starting point. Apologies for that.)
Has anybody already looked into it and whether a Julia reimplementation would be useful given that is seems it would depart from the current focus on autodiff?