Non-convex loss functions with Flux.jl

MattSainsbury-Dale · October 5, 2022, 6:38am

I am interested in training a Flux neural network under a non-convex loss function. Specifically, the loss function L_P(x, y) = (|x - y|)^P for some 0 < P < 1, which is implemented in LossFunctions.jl as LPDistLoss.

Is it possible to use this non-convex loss function in Flux? I have attempted to implement L_P(\cdot, \cdot) by adapting the source code for Flux.mae as follows:

# ofeltype() and _check_sizes() are internal functions from Flux
function LP(ŷ, y; agg = mean, P = ofeltype(ŷ, 0.9))
  _check_sizes(ŷ, y)
  agg(abs.(ŷ .- y).^P)
end

However, even with P = 0.9, which is only slightly different to the well-behaved absolute-error loss, some parameters of the neural network become NaN during training.

Is this a problem inherent to this loss function, or is there something wrong with my implementation? Thanks in advance for any comments or suggestions.

ToucheSir · October 6, 2022, 12:15am

That definition looks fine to me. Is there an existing working example of training a model with this loss that you can compare against?

mcabbott · October 6, 2022, 12:59am

Note that the gradient is quite different near to zero, x^-0.1 diverges. That could be what goes wrong?

two_powers

You could try something like (abs.(ŷ .- y) .+ 1f-6).^P to move slightly away from the point. You could also try P = 1.1 to check

MattSainsbury-Dale · October 6, 2022, 1:12pm

Many thanks for your replies.

@ToucheSir Is there an existing working example of training a model with this loss that you can compare against?

I am not aware of any such examples. I am interested in this loss function because it tends towards the 0-1 loss as P \to 0. There are many well-behaved surrogates for the 0-1 loss that work in a classification setting (i.e., with discrete outputs), but I haven’t found any examples of people doing this for a continuous output, like in my application. (Do you know of any?)

@mcabbott Note that the gradient is quite different near to zero, x^-0.1 diverges. That could be what goes wrong?

Thanks for your insights, I think you’ve hit the nail on the head. I tried your suggestion of adding a small positive quantity to |x - y|, and this fixed the NA problem. Thanks!

You may be interested that adding a small positive quantity to |x - y| actually has a large effect on the loss function as P \to 0. In particular, it increases its minimum value (it should be zero):

DanielV · October 18, 2022, 3:19am

Note that algorithmically speaking, the minimum value of the loss has minor effect, on both the solution found and the behavior of algorithms.

Might want to focus attention on the locally uniform upper bound on the gradient (which affects both the useful step sizes and the effect of a single loss on a local minimum, under assumption of local strong convexity). So consider using something like SCAD [1] (presented as penalizer, but something like it can be used as a loss), with its L1-like behavior near 0. You can always adjust the constants to iget closer to 0-1, while retaining uniform bounded gradient norm.

[1] https://fan.princeton.edu/papers/01/penlike.pdf

Topic		Replies	Views
Specifying loss functions in Flux.jl Machine Learning question , package , flux	8	1917	August 8, 2020
AD Troubles in Flux and Unusual Loss Machine Learning flux , zygote , fluxtraining	1	117	November 24, 2024
I am unable to fit a simple 2d function using a neural network in Julia. Am I doing somethng wrong here? General Usage flux , neural-network	3	268	September 5, 2023
Flux/DiffEqFlux: error when a loss function does not use all the weights of the NN Machine Learning diffeq , flux	8	597	June 3, 2021
How come Flux.jl's network parameters go to NaN? Machine Learning first-steps , flux	10	4055	June 9, 2021

Non-convex loss functions with Flux.jl

Related topics