What is the fastest dimension ordering for convolutions?

maxfreu · September 7, 2020, 7:45am

Hi!
I am wondering what the fastest dimension ordering for convolution operations on the GPU using CUDNN currently is. Flux uses WHCN (column major), pytorch uses NCHW (row major) which is exactly the same. So to say, the colors of a pixel in an array are stored “scattered” with a certain stride. This post gives an illustration and suggests that CWHN (column major) memory layout offers performance gains. In CWHN layout all colors of an array are stored together, then comes the next pixel. Is Flux or Knet flexible with this respect?

The background is that I’m trying to get rid of unnecessary array permutations when feeding a CNN. To be honest I didnt benchmark to find out how much of a problem this is, but I like the concept of having the same memory layout for augmentation and training (and maybe even on disk). Most image processing packages (or at least Augmentor) work on abstract matrices, so 2D arrays of e.g. RGB pixels. This can be reinterpreted as a CWH array, as the colors are stored adjacent in memory. I assume this is for a reason and e.g. array interpolations with this data layout are the fastest. So: can this be brought in line?

maxfreu · January 25, 2021, 12:00pm

The answer to this question is partly given in the pytorch forum here.

In short: The memory access pattern of WHCN is better for some convolution implementations, but worse for batch norm. As models use all different kinds of layers, the effect of collated data access depends on the layer mix.

The story somewhat changes with tensor cores, as they are faster with CWHN layout, see this PDF, slide 11 and the last.

Edit: Related github issues for pytorch here and here.

Topic		Replies	Views
Using real NCHW order when using cuDNN.jl GPU cuda , cudnn	4	376	June 30, 2023
Tensor dimension order on convolution layer General Usage flux	3	781	December 15, 2019
Computational performance when broadcasting a convolution layer Machine Learning performance , flux	5	546	August 24, 2021
`Conv` is 2x slow than pytorch `Conv` on cpu Machine Learning	9	2021	September 6, 2020
Modifying Flux source code for GPU Specific Domains	4	194	March 27, 2023

What is the fastest dimension ordering for convolutions?

Related topics