Flux normalization and regularisation


I am using a Flux model for regression with L1 regularization like below:

model = Chain(Dense(size(X_train,1),254, relu),
              Dense(254, 254, relu), 
              Dense(254, 1,identity))   
losses =[]
l1norm(x) = sum(abs, x)
loss() = sum(abs2,model(X_train)' - Y_train) + sum(l1norm, Flux.params(model))
cb = function ()
Flux.train!(loss,params(model),Iterators.repeated((), 50), ADAM(0.001), cb = cb) 

(1) How do I modify the code to add regularisation at each layer?
(2) what’s the best way to fit the model to standardised data? I understand that there is a normalise() function in Flux, but I am not sure how to implement this function in the code. Does the function normalises just the input? or does it normalises input at each layer?
(3) how do I do prediction if I fit the model to normalised input?



This part of the loss sum(l1norm, Flux.params(model)) seems like a regularisation over each layer to me. Flux.params extracts all the parameters and l1norm is applied on each, resulting in summing the absolute values of all parameters in the model.

If you want to access a specific layer and add a special regularization for that you could do something like

function loss(...)
    alpha = ... # Regularization factor
    fit_loss = ...
    layer2params = Flux.params(model[2]) # All params for layer 2 in model
    layer2weights = layer2params[1] # Only take weights, not biases
    regul_loss = some_reduction_function(layer2weights)
    return fit_loss + alpha * regul_loss


The flux normalize function looks like this

@inline function normalise(x::AbstractArray; dims=ndims(x), ϵ=ofeltype(x, 1e-5))
  μ = mean(x, dims=dims)
  σ = std(x, dims=dims, mean=μ, corrected=false)
  return @. (x - μ) / (σ + ϵ)

It looks like it averages over the last dimension, it does not do anything between the layers but is rather something in the pipeline before you feed the data to your model.
If you want normalization in the model you might want to have a look at the normalization layers here.


You would need to normalize the data you want to predict on also, and I would normalize it with the same transformation as for the training data. So save the mean and stddev used to normalise the training data, and use those to transform prediction data as well by feeding it through @. (x - μ) / (σ + ϵ).

1 Like

Thanks @albheim. Another question, may be a technical one. What would be the advantage(s) of normalising each layer, rather than the data? I come from a statistical background where we usually normalise/standardise the data, and fit a model to that data. So curious to know any benefits of normalising at each layer.

Which method is the best practice in deep learning, to normalise the data? or to normalise the model at each layer? or both.

My feeling is that normalising data is something that should pretty much always be done, and normalising layers is more of a thing i have encountered a few times in some specific models, so probably not as impactful in general. I guess it is supposed to have similar effect, but it feels like doing it at the start might have a larger impact in most cases.

Thanks @albheim. Another question. Is there a way to constraint the weights to be non-negative in the above model? I know GLMNet.jl (GitHub - JuliaStats/GLMNet.jl: Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet ) Lasso package can do this. But was wondering if deep learning models can do that as well.

Not sure if it exists, but you can always create something to enforce that yourself. Here I just keep the params as a real vector, and I do e^{ps} before setting them as weights to make sure all are positive.

x, y = randn(2, 100), randn(1, 100)

model_shape = Chain(Dense(2, 10, tanh), Dense(10, 1))

ps, g = Flux.destructure(model_shape)
get_model(ps) = g(exp.(ps)) # Here we make sure they are positive

loss(x, y, model) = sum(abs2, model(x) .- y)

gs = Flux.gradient(ps -> loss(x, y, get_model(ps)), ps)

opt = ADAM()
Flux.update!(opt, ps, gs[1])

Although I am not sure about other areas of deep learning, in image processing, normalization layers are used very often (basically in all recent architectures). One way to view the normalization layers is as regularization: they force the layer outputs to have certain statistical properties (like specific mean and standard deviation). Of course, there are other aspects too.

Generally, normalization layers can be beneficial and are worth investing a few minutes to study.

1 Like

A good overview of common normalization strategies I found recently is Normalization is dead, long live normalization! · The ICLR Blog Track.

1 Like