Inconsitent gradients of BachNorm on GPU in testmode

I encountered an issue with the gradient of the BatchNorm on the GPU in testmode.
If the shape of the input array is either 2, 4 or 5, the gradients will differ from the result of any other sized array. The results on the CPU on the other hand stay consistent with the dimensionality of the array.

Here is a minimum example of this behavior:

using Flux, CUDA, Zygote

function gradient_varying_shape(m, x, n_dims, device)
    m = m |> device
    Flux.testmode!(m)

    x = reshape(x, ntuple(i -> 1, n_dims)) |> device
    return gradient(input -> sum(m(input).^2), x)[1] |> cpu
end

model = BatchNorm(1)
x = [1f0]

for i=2:7
    cpu_gradient = gradient_varying_shape(model, x, i, cpu) 
    gpu_gradient = gradient_varying_shape(model, x, i, gpu) 
    println("n_dim=$i, cpu: $(cpu_gradient[1]), gpu: $(gpu_gradient[1])")
end

The output I get from running this code is

n_dim=2, cpu: 1.99998, gpu: 0.0
n_dim=3, cpu: 1.99998, gpu: 1.99998
n_dim=4, cpu: 1.99998, gpu: 0.0
n_dim=5, cpu: 1.99998, gpu: 0.0
n_dim=6, cpu: 1.99998, gpu: 1.99998
n_dim=7, cpu: 1.99998, gpu: 1.99998

I suspect that this issue is connected to the cuDNN library, and looking into the implementation of the batch normalization there, I found this comment, stating that BatchNorm operations are only supported for 4D and 5D tensors, and that 2D tensors are reshaped to 4D. Looking at the implementation however I can’t find anything that looks wrong.
Is there maybe a reverse-rule missing or implemented incorrectly for Zygote when the cuDNN implementation is used in testmode?

I’m using Julia 1.9.3 with this environment:

  [052768ef] CUDA v5.0.0
  [587475ba] Flux v0.14.6
  [e88e6eb3] Zygote v0.6.66
  [02a925ec] cuDNN v1.2.0