(Flux) certain arbitrary model sizes and random seeds make gradients exactly zero

I’m building a small neural network to predict a one-dimensional output from 5 input features. Simplified reproducing example:

function build_model(mid_layer_size, seed)
    Random.seed!(seed)
    return Chain(
        Dense(5 => 8, relu),
        Dense(8 => mid_layer_size, relu),
        Dense(mid_layer_size => 1, relu),
    )
end

Usually, this works just fine: gradients propagate, the model trains, the final loss is small-ish, :cake:. But if I pick a bad model size or random seed, all the gradients are zero:

julia> let model = build_model(60, 123)
           Flux.gradient(params(model)) do
               Flux.Losses.mse(model(X), Y')
           end |> collect
       end
6-element Vector{Any}:
 Float32[-0.20297855 -0.06348246 … -0.06014689 -0.13047417; 0.03799633 0.04431212 … 0.00032865477 0.03499769; … ; 0.06786294 0.016918216 … 0.0023034615 0.05827938; 0.03681226 -0.01044857 … 0.08957804 0.01665652]
 Float32[0.25475615, -0.017182503, -0.0042427997, 0.30430275, 0.05694324, -0.006684456, 0.007670869, -0.01128939]
 Float32[-0.0033954177 0.022214556 … 0.05395927 0.0024106728; 0.0430058 -0.0035434365 … -0.04696391 0.04170979; … ; -0.028552402 0.002109989 … 0.03214287 -0.031267297; 0.0050330604 -6.393007f-6 … 0.0014786139 0.0]
 Float32[0.02659651, 0.055861957, 0.022643797, 0.13066903, 0.086278394, 0.029695854, 0.048429545, 0.010760561, 0.14349356, 0.0024125827  …  0.009284487, -0.04251157, 0.17795762, -0.013358652, 0.18663743, -0.016749624, 0.0062961443, 0.029387519, -0.03841583, 0.010253376]
 Float32[-0.086628556 0.1706067 … 0.25115177 0.0004147935]
 Float32[0.7515923]

julia> let model = build_model(60, 1234)
           Flux.gradient(params(model)) do
               Flux.Losses.mse(model(X), Y')
           end |> collect
       end
6-element Vector{Any}:
 Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
 Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
 Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 Float32[0.0 0.0 … 0.0 0.0]
 Float32[0.0]

It looks like ~10% of random seeds produce gradients of all zero with a mid_layer_size of 60, though the all-zero-gradients depend on both model size and seed. E.g. seed=1234 produces all zeros with a mid_layer_size of 60, but not 59 or 61.

Any tips on avoiding these pathological gradients, or what causes them to arise? Thanks!

The data is unexciting, low dimensional, real-world (and slightly noisy, so unlikely to be particularly pathological). I’ve standardized features to have unit standard deviation and zero mean.

julia> X
5×6768 Matrix{Float64}:
 -1.11775     0.158752   -1.09075    -0.782232  …  -0.546986   -1.19681    -0.65304
  0.0850559   0.0850559   0.0850559   3.1748        0.0850559   0.0850559   0.0850559
 -0.510957   -0.0790965  -0.49218    -0.210532     -0.216791   -0.404556   -0.0165081
 -0.277386    2.21909    -0.277386   -0.277386     -0.277386   -0.277386   -0.277386
 -0.697098   -0.433558   -0.694995   -0.448978     -0.12446    -0.688687   -0.32632

julia> Y'
1×6768 adjoint(::Vector{Float64}) with eltype Float64:
 -0.834126  0.229278  -0.51893  -0.699388  …  -0.734943  0.246353  -1.00697  -1.08206

It looks like that last relu will make all gradients zero if the last Dense outputs a negative value which should be about 50% chance. Try skipping it or set it to identity.

1 Like

This seems like it, thank you! Real facepalm as well.

Do you know why this phenomenon of zero gradients only happens for ~10% of random seeds, not ~50%, and is dependent on network size? Something to do with how Flux initializes random weights? I’d naively think that with a single relu output, 50% of the time the random weights would put the initial solution in the flat (< 0) region.

(Initially I thought this was also happening with sigmoid activations, but not quite - with sigmoids the gradients can be very small, seed depending, but not exactly strictly zero)

1 Like

Happens to everyone. I suppose this is why rubberducks have become so popular among programmers (or at least the meme is popular, never seen anyone actually use it for real).

I dont have an explanation of the 10% vs 50%. How many samples did you try? It could be an interesting exercise to determine what the expectation is for a given network architecture and initialization. Maybe it has already been done in some “dead neuron” paper.