What is the fastest dimension ordering for convolutions?

I am wondering what the fastest dimension ordering for convolution operations on the GPU using CUDNN currently is. Flux uses WHCN (column major), pytorch uses NCHW (row major) which is exactly the same. So to say, the colors of a pixel in an array are stored “scattered” with a certain stride. This post gives an illustration and suggests that CWHN (column major) memory layout offers performance gains. In CWHN layout all colors of an array are stored together, then comes the next pixel. Is Flux or Knet flexible with this respect?

The background is that I’m trying to get rid of unnecessary array permutations when feeding a CNN. To be honest I didnt benchmark to find out how much of a problem this is, but I like the concept of having the same memory layout for augmentation and training (and maybe even on disk). Most image processing packages (or at least Augmentor) work on abstract matrices, so 2D arrays of e.g. RGB pixels. This can be reinterpreted as a CWH array, as the colors are stored adjacent in memory. I assume this is for a reason and e.g. array interpolations with this data layout are the fastest. So: can this be brought in line?