Regularization with Flux

I’m currently training an ANN with 54 inputs and 12 outputs and I have already achieved good results by using the following model:

model = Chain(Dense(54,54,sigmoid), Dense(54,54,sigmoid), Dense(54,12,leakyrelu)).

However, I’m trying to apply regularization, in order to improve my results. I’m currently using the mse loss function. I tried to implement regularization by doing:

opt = Optimiser(WeightDecay(lambda), ADAGrad()),

and I set lambda=1, but I’m not getting better results. Any ideias about how I could implement regularization?

Here’s my code for the ANN training:

function flux_training(x_train::Array{Float64,2}, y_train::Array{Float64,2}, n_epochs::Int, lambda::Int)
    model = Chain(Dense(54,54,sigmoid),Dense(54,54,sigmoid),Dense(54,12,leakyrelu))
    loss(x,y) = Flux.mse(model(x),y) 
    ps = params(model)
    dataset = Flux.Data.DataLoader(x_train', y_train', batchsize = 32, shuffle = true)
    opt = Optimiser(WeightDecay(lambda), ADAGrad())
    evalcb() = @show(loss(x_train', y_train'))
    for epoch in 1:n_epochs
        println("Epoch $epoch")
        time = @elapsed Flux.train!(loss, ps, dataset, opt, cb=throttle(evalcb,3))
    end
    
    y_hat = model(x_train')' 

    return y_hat, model

end

There is no guarantee that a regularizer like weight decay will improve your results. First of all, it probably needs some tuning of the lambda, second, your network is already quite small, and it may thus not suffer from overfitting, but erhaps from underfitting, in which case the regularizer might make it worse.

How much data do you have? Have you tried dropout, it’s usually a better bet than L2 weight decay in my experience. Does performance go up or down if you increase the capacity/size of the model?

2 Likes

Thanks for the answer! Each input of my training set has around 57000 elements. Indeed, my network is small, so I guess I won’t need to apply regularization after all (I tried different values for lambda, but couldn’t improve my results). I calculated some metrics for the out-of-sample results and obtained, at the worst case, a MAPE of 1.4%, so I think the network is already in a good fit. But I didn’t know that regularization could make the results worse, so thanks for the explanation!

Weight decay is an addition to the cost function saying that the weights should be small in the L_2 sense, this can be at odds with having weights that fit the data well. In some situations, there is a reason to believe that small model parameters are better than large, but you can easily imagine that if you let lambda go to infinity, your weights will go to zero and zero weights does not give you a good model.

57k training samples sounds like a good amount, I would try to make the model larger until the validation error goes up, and then you might have found a sweetspot without getting too complicated.