I am interested in training a Flux neural network under a non-convex loss function. Specifically, the loss function L_P(x, y) = (|x - y|)^P for some 0 < P < 1, which is implemented in LossFunctions.jl as LPDistLoss.
Is it possible to use this non-convex loss function in Flux? I have attempted to implement L_P(\cdot, \cdot) by adapting the source code for Flux.mae as follows:
# ofeltype() and _check_sizes() are internal functions from Flux
function LP(ŷ, y; agg = mean, P = ofeltype(ŷ, 0.9))
_check_sizes(ŷ, y)
agg(abs.(ŷ .- y).^P)
end
However, even with P = 0.9, which is only slightly different to the well-behaved absolute-error loss, some parameters of the neural network become NaN during training.
Is this a problem inherent to this loss function, or is there something wrong with my implementation? Thanks in advance for any comments or suggestions.
@ToucheSir Is there an existing working example of training a model with this loss that you can compare against?
I am not aware of any such examples. I am interested in this loss function because it tends towards the 0-1 loss as P \to 0. There are many well-behaved surrogates for the 0-1 loss that work in a classification setting (i.e., with discrete outputs), but I haven’t found any examples of people doing this for a continuous output, like in my application. (Do you know of any?)
@mcabbott Note that the gradient is quite different near to zero, x^-0.1 diverges. That could be what goes wrong?
Thanks for your insights, I think you’ve hit the nail on the head. I tried your suggestion of adding a small positive quantity to |x - y|, and this fixed the NA problem. Thanks!
You may be interested that adding a small positive quantity to |x - y| actually has a large effect on the loss function as P \to 0. In particular, it increases its minimum value (it should be zero):
Note that algorithmically speaking, the minimum value of the loss has minor effect, on both the solution found and the behavior of algorithms.
Might want to focus attention on the locally uniform upper bound on the gradient (which affects both the useful step sizes and the effect of a single loss on a local minimum, under assumption of local strong convexity). So consider using something like SCAD [1] (presented as penalizer, but something like it can be used as a loss), with its L1-like behavior near 0. You can always adjust the constants to iget closer to 0-1, while retaining uniform bounded gradient norm.