Using real NCHW order when using cuDNN.jl

I am trying to use cuDNN.jl for GPU accelerated convolution.

I used the function cuDNN.cudnnConvolutionForward. The function has the keyword argument format that specifies the order of dimensions (format=cuDNN.CUDNN_TENSOR_NHWC or format=CUDNN_TENSOR_NCHW). However, the Julia dimensions have the opposite order.
However, my data is in real NCHW order (not in the opposite order).

I used permutedims as a work-around:

function conv_cudnn(x, w, b; stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1)
    x = CuArray(x)
    x = permutedims(x, (4, 3, 2, 1))

    w = CuArray(w)
    w = permutedims(w, (4, 3, 2, 1))

    b = CuArray(b)
    b = reshape(b, (1, 1, length(b), 1))

    y = CUDA.@time cuDNN.cudnnConvolutionForward(w, x, bias=b, padding=padding, stride=stride, dilation=dilation, group=groups, reorderType=cuDNN.CUDNN_DEFAULT_REORDER, mode=cuDNN.CUDNN_CROSS_CORRELATION)

    return permutedims(y, (4, 3, 2, 1))
end

My inputs look like this:

# define inputs (real NCHW order)
x = rand(32, 16, 64, 64)
w = rand(32, 8, 5, 5)
b = rand(32)

Is there a better or faster way to use cuDNN with “real” NCHW order (e.g. without using permutedims)?

Best regards and thank you in advance!

Note that NCHW order for a row-major API like cuDNN corresponds to the exact same memory layout as WHCN order for a column-major language like Julia. Thus I’m not exactly sure what constitutes “real” in this context. Perhaps you could provide some more background on why you need to have NCHW order data in Julia, this may be a XY problem.

I’ve written some of the well known deep learning algorithms in Julia (convolution, (adaptive) pooling, etc.) - just because I wanted to see how deep learning works at “low-level”. Because I came from PyTorch to Julia, I kept the NCHW order. When I started with Julia, I didn’t know about the difference between row-major/column-major ordered. For testing purposes, I checked my implementations against PyTorch using PyCall. I wanted to avoid always reordering the arrays when swapping with PyTorch (using e.g. permutedims). Now, I wanted to accelerate my implementations using CUDA.jl and cuDNN.jl. But if at all possible, I wanted to avoid switching my whole system to “WHCN”.

Thanks for the context. You should be able to avoid switching your whole system over by use of wrappers and additional interop libraries. The general idea is as follows:

Thank you for the lots of information. Using PermutedDimsArray will probably make things easier. Up until now I didn’t know of any elegant or built-in way to use a permuted array without copying the data.