What is the fastest dimension ordering for convolutions?

I am wondering what the fastest dimension ordering for convolution operations on the GPU using CUDNN currently is. Flux uses WHCN (column major), pytorch uses NCHW (row major) which is exactly the same. So to say, the colors of a pixel in an array are stored “scattered” with a certain stride. This post gives an illustration and suggests that CWHN (column major) memory layout offers performance gains. In CWHN layout all colors of an array are stored together, then comes the next pixel. Is Flux or Knet flexible with this respect?

The background is that I’m trying to get rid of unnecessary array permutations when feeding a CNN. To be honest I didnt benchmark to find out how much of a problem this is, but I like the concept of having the same memory layout for augmentation and training (and maybe even on disk). Most image processing packages (or at least Augmentor) work on abstract matrices, so 2D arrays of e.g. RGB pixels. This can be reinterpreted as a CWH array, as the colors are stored adjacent in memory. I assume this is for a reason and e.g. array interpolations with this data layout are the fastest. So: can this be brought in line?

The answer to this question is partly given in the pytorch forum here.

In short: The memory access pattern of WHCN is better for some convolution implementations, but worse for batch norm. As models use all different kinds of layers, the effect of collated data access depends on the layer mix.

The story somewhat changes with tensor cores, as they are faster with CWHN layout, see this PDF, slide 11 and the last.

Edit: Related github issues for pytorch here and here.