Knet vs Flux etc

Sounds good. I think there are three sources of potential speed-up:

  • AD: AutoGrad vs Flux vs Zygote vs Capstain etc.
  • Alloc: KnetArray vs CuArray vs CPU etc.
  • Kernels: Knet kernels vs CUDANative/Flux vs CPU etc.

My GPU experiments vary all 3 components, which makes it difficult to pinpoint causes. My CPU experiments only vary the AD, so that can give us some clues right away. I think I can easily run Knet with CuArray alloc/kernels which should give another AD comparison. Your suggestion of using CuArray allocator with Knet kernels should highlight allocator differences. This is a bit more difficult to implement (the kernels dispatch based on the KnetArray type) but doable. We can probably figure out other combinations of the above three components that will inform the optimization work.

1 Like