Flux normalization and regularisation

Aminath_Shausan · April 24, 2022, 2:08am

Hi

I am using a Flux model for regression with L1 regularization like below:

model = Chain(Dense(size(X_train,1),254, relu),
              Dense(254, 254, relu), 
              Dense(254, 1,identity))   
losses =[]
l1norm(x) = sum(abs, x)
loss() = sum(abs2,model(X_train)' - Y_train) + sum(l1norm, Flux.params(model))
cb = function ()
   push!(losses,loss())
   print(".")
end
Flux.train!(loss,params(model),Iterators.repeated((), 50), ADAM(0.001), cb = cb)

Questions:
(1) How do I modify the code to add regularisation at each layer?
(2) what’s the best way to fit the model to standardised data? I understand that there is a normalise() function in Flux, but I am not sure how to implement this function in the code. Does the function normalises just the input? or does it normalises input at each layer?
(3) how do I do prediction if I fit the model to normalised input?

Thanks

albheim · April 24, 2022, 7:22am

1

This part of the loss sum(l1norm, Flux.params(model)) seems like a regularisation over each layer to me. Flux.params extracts all the parameters and l1norm is applied on each, resulting in summing the absolute values of all parameters in the model.

If you want to access a specific layer and add a special regularization for that you could do something like

function loss(...)
    alpha = ... # Regularization factor
    fit_loss = ...
    layer2params = Flux.params(model[2]) # All params for layer 2 in model
    layer2weights = layer2params[1] # Only take weights, not biases
    regul_loss = some_reduction_function(layer2weights)
    return fit_loss + alpha * regul_loss
end

2

The flux normalize function looks like this

@inline function normalise(x::AbstractArray; dims=ndims(x), ϵ=ofeltype(x, 1e-5))
  μ = mean(x, dims=dims)
  σ = std(x, dims=dims, mean=μ, corrected=false)
  return @. (x - μ) / (σ + ϵ)
end

It looks like it averages over the last dimension, it does not do anything between the layers but is rather something in the pipeline before you feed the data to your model.
If you want normalization in the model you might want to have a look at the normalization layers here.

3

You would need to normalize the data you want to predict on also, and I would normalize it with the same transformation as for the training data. So save the mean and stddev used to normalise the training data, and use those to transform prediction data as well by feeding it through @. (x - μ) / (σ + ϵ).

Aminath_Shausan · May 2, 2022, 11:27pm

Thanks @albheim. Another question, may be a technical one. What would be the advantage(s) of normalising each layer, rather than the data? I come from a statistical background where we usually normalise/standardise the data, and fit a model to that data. So curious to know any benefits of normalising at each layer.

Which method is the best practice in deep learning, to normalise the data? or to normalise the model at each layer? or both.
Thanks

albheim · May 2, 2022, 11:40pm

My feeling is that normalising data is something that should pretty much always be done, and normalising layers is more of a thing i have encountered a few times in some specific models, so probably not as impactful in general. I guess it is supposed to have similar effect, but it feels like doing it at the start might have a larger impact in most cases.

Aminath_Shausan · May 3, 2022, 12:09am

Thanks @albheim. Another question. Is there a way to constraint the weights to be non-negative in the above model? I know GLMNet.jl (GitHub - JuliaStats/GLMNet.jl: Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet ) Lasso package can do this. But was wondering if deep learning models can do that as well.
Thanks

albheim · May 3, 2022, 7:01am

Not sure if it exists, but you can always create something to enforce that yourself. Here I just keep the params as a real vector, and I do e^{ps} before setting them as weights to make sure all are positive.

x, y = randn(2, 100), randn(1, 100)

model_shape = Chain(Dense(2, 10, tanh), Dense(10, 1))

ps, g = Flux.destructure(model_shape)
get_model(ps) = g(exp.(ps)) # Here we make sure they are positive

loss(x, y, model) = sum(abs2, model(x) .- y)

gs = Flux.gradient(ps -> loss(x, y, get_model(ps)), ps)

opt = ADAM()
Flux.update!(opt, ps, gs[1])

barucden · May 3, 2022, 7:41am

Although I am not sure about other areas of deep learning, in image processing, normalization layers are used very often (basically in all recent architectures). One way to view the normalization layers is as regularization: they force the layer outputs to have certain statistical properties (like specific mean and standard deviation). Of course, there are other aspects too.

Generally, normalization layers can be beneficial and are worth investing a few minutes to study.

ToucheSir · May 3, 2022, 2:34pm

A good overview of common normalization strategies I found recently is Normalization is dead, long live normalization! · The ICLR Blog Track.

Topic		Replies	Views
Flux Regularisation in Julia at the layer level New to Julia	1	631	January 8, 2020
Differentiating the Custom Layer and Regularizer New to Julia flux	7	547	June 14, 2022
Regularization with Flux Machine Learning flux	3	1122	October 29, 2020
Sum of Flux.params Machine Learning	1	277	February 4, 2021
Training layers of a Flux model separately Machine Learning question	1	374	November 13, 2021

Flux normalization and regularisation

1

2

3

Related topics