Inconsitent gradients of BachNorm on GPU in testmode

Phypix · October 19, 2023, 2:49pm

I encountered an issue with the gradient of the BatchNorm on the GPU in testmode.
If the shape of the input array is either 2, 4 or 5, the gradients will differ from the result of any other sized array. The results on the CPU on the other hand stay consistent with the dimensionality of the array.

Here is a minimum example of this behavior:

using Flux, CUDA, Zygote

function gradient_varying_shape(m, x, n_dims, device)
    m = m |> device
    Flux.testmode!(m)

    x = reshape(x, ntuple(i -> 1, n_dims)) |> device
    return gradient(input -> sum(m(input).^2), x)[1] |> cpu
end

model = BatchNorm(1)
x = [1f0]

for i=2:7
    cpu_gradient = gradient_varying_shape(model, x, i, cpu) 
    gpu_gradient = gradient_varying_shape(model, x, i, gpu) 
    println("n_dim=$i, cpu: $(cpu_gradient[1]), gpu: $(gpu_gradient[1])")
end

The output I get from running this code is

n_dim=2, cpu: 1.99998, gpu: 0.0
n_dim=3, cpu: 1.99998, gpu: 1.99998
n_dim=4, cpu: 1.99998, gpu: 0.0
n_dim=5, cpu: 1.99998, gpu: 0.0
n_dim=6, cpu: 1.99998, gpu: 1.99998
n_dim=7, cpu: 1.99998, gpu: 1.99998

I suspect that this issue is connected to the cuDNN library, and looking into the implementation of the batch normalization there, I found this comment, stating that BatchNorm operations are only supported for 4D and 5D tensors, and that 2D tensors are reshaped to 4D. Looking at the implementation however I can’t find anything that looks wrong.
Is there maybe a reverse-rule missing or implemented incorrectly for Zygote when the cuDNN implementation is used in testmode?

I’m using Julia 1.9.3 with this environment:

  [052768ef] CUDA v5.0.0
  [587475ba] Flux v0.14.6
  [e88e6eb3] Zygote v0.6.66
  [02a925ec] cuDNN v1.2.0

Topic		Replies	Views
BatchNorm gradient is inconsistent with PyTorch in training mode Machine Learning question , flux	2	853	December 18, 2020
Compilation error in Zygote, Flux and CUDA interaction Machine Learning question , cuda , error , flux , zygote	3	391	January 6, 2023
Zygote: CuArray only supports bits types General Usage	6	1326	October 25, 2020
Zygote returning wrong gradient on GPU Machine Learning	4	865	May 10, 2022
GPU gradient calculation with Zygote failling with an RTX 4060 General Usage question	1	184	December 17, 2023

Inconsitent gradients of BachNorm on GPU in testmode

Related topics