I’m building a small neural network to predict a one-dimensional output from 5 input features. Simplified reproducing example:
function build_model(mid_layer_size, seed)
Random.seed!(seed)
return Chain(
Dense(5 => 8, relu),
Dense(8 => mid_layer_size, relu),
Dense(mid_layer_size => 1, relu),
)
end
Usually, this works just fine: gradients propagate, the model trains, the final loss is small-ish, . But if I pick a bad model size or random seed, all the gradients are zero:
julia> let model = build_model(60, 123)
Flux.gradient(params(model)) do
Flux.Losses.mse(model(X), Y')
end |> collect
end
6-element Vector{Any}:
Float32[-0.20297855 -0.06348246 … -0.06014689 -0.13047417; 0.03799633 0.04431212 … 0.00032865477 0.03499769; … ; 0.06786294 0.016918216 … 0.0023034615 0.05827938; 0.03681226 -0.01044857 … 0.08957804 0.01665652]
Float32[0.25475615, -0.017182503, -0.0042427997, 0.30430275, 0.05694324, -0.006684456, 0.007670869, -0.01128939]
Float32[-0.0033954177 0.022214556 … 0.05395927 0.0024106728; 0.0430058 -0.0035434365 … -0.04696391 0.04170979; … ; -0.028552402 0.002109989 … 0.03214287 -0.031267297; 0.0050330604 -6.393007f-6 … 0.0014786139 0.0]
Float32[0.02659651, 0.055861957, 0.022643797, 0.13066903, 0.086278394, 0.029695854, 0.048429545, 0.010760561, 0.14349356, 0.0024125827 … 0.009284487, -0.04251157, 0.17795762, -0.013358652, 0.18663743, -0.016749624, 0.0062961443, 0.029387519, -0.03841583, 0.010253376]
Float32[-0.086628556 0.1706067 … 0.25115177 0.0004147935]
Float32[0.7515923]
julia> let model = build_model(60, 1234)
Flux.gradient(params(model)) do
Flux.Losses.mse(model(X), Y')
end |> collect
end
6-element Vector{Any}:
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Float32[0.0 0.0 … 0.0 0.0]
Float32[0.0]
It looks like ~10% of random seeds produce gradients of all zero with a mid_layer_size
of 60, though the all-zero-gradients depend on both model size and seed. E.g. seed=1234
produces all zeros with a mid_layer_size
of 60, but not 59 or 61.
Any tips on avoiding these pathological gradients, or what causes them to arise? Thanks!
The data is unexciting, low dimensional, real-world (and slightly noisy, so unlikely to be particularly pathological). I’ve standardized features to have unit standard deviation and zero mean.
julia> X
5×6768 Matrix{Float64}:
-1.11775 0.158752 -1.09075 -0.782232 … -0.546986 -1.19681 -0.65304
0.0850559 0.0850559 0.0850559 3.1748 0.0850559 0.0850559 0.0850559
-0.510957 -0.0790965 -0.49218 -0.210532 -0.216791 -0.404556 -0.0165081
-0.277386 2.21909 -0.277386 -0.277386 -0.277386 -0.277386 -0.277386
-0.697098 -0.433558 -0.694995 -0.448978 -0.12446 -0.688687 -0.32632
julia> Y'
1×6768 adjoint(::Vector{Float64}) with eltype Float64:
-0.834126 0.229278 -0.51893 -0.699388 … -0.734943 0.246353 -1.00697 -1.08206