A recent algorithm developed by Rice University + Intel claims a ~3x speedup on CPU only as compared to Nvidia V100 and 10x speedup over Tensorflow CPU.
A lot of the details are over my head, so I won’t even attempt to describe it. (Not a very useful starting point. Apologies for that.)
Has anybody already looked into it and whether a Julia reimplementation would be useful given that is seems it would depart from the current focus on autodiff?