Request to upgrade to LossFunctions.jl

Any suggestion on how to handle deriv and deriv2 in this approach @aplavin ? Currently we can perform these aggregations with first and second derivatives as well.

The design space here is a bit difficult to navigate, but I think we are converging into something that has all the benefits of nice Julian syntax, nice performance and clean implementation.

I think you’re losing sight of the actual use case for these loss functions. Within the context of MLJ, loss functions (called “measures” in MLJ) are used for evaluating machine learning models, with a typical use case looking like this:

evaluate!(
    mach,
    resampling = CV(nfolds=5),
    measure = [rms, mae]
)

With the approach you’re advocating, I would have to write that like this:

evaluate!(
    mach,
    resampling = CV(nfolds=5),
    measure = [
        (ŷ, y) -> sqrt(mean(L2DistLoss(), ŷ, y)),
        (ŷ, y) -> mean(L1DistLoss(), ŷ, y))
    ]
)

If we take your approach to its logical conclusion, no one would ever be allowed to write a function that takes an AbstractArray argument (unless that AbstractArray represents a mathematical vector, matrix, or tensor). But that’s a bit pedantic—factoring out commonly used expressions into separate functions is a normal part of programming.

All that being said, my feeling is that these metric functions are so trivial that there’s not much to be gained by attempting to get Flux and MLJ to use the same loss function implementation. (Is that the implicit goal of LossFunctions.jl?)

2 Likes

You are quoting me on implicit broadcasting methods, but your example is on the aggregation. These two are effectively independent. Maybe the latter should stay, idk, but the former should really go away.

Agree with you @CameronBieganek that there is still value in considering the vector-based methods. We are really brainstorming here to see what is the best compromise, nothing decided yet.

It would be extremely helpful for the community to maintain a common set of loss functions in a single repository that is shared among different ML frameworks. It is ideal scenario of course, but I think we should put energy into that direction.

1 Like

Regarding aggregation: I only compared this to value(L1DistLoss(), ŷ, y, AggMode.Mean()). Sure, you can pack this into MAE(), either way.

1 Like

Let me summarize the status quo so that we can organize the discussion moving forward…

Currently we have loss functions in LossFunctions.jl that are represented with structs, and this is useful to hold state and hyperparameters that sometimes exist. Given a loss object we can do 3 things with it:

  • value: the actual value loss(yhat, y)
  • deriv: the first derivative
  • deriv2: the second derivative

These derivatives are written by hand, and sometimes are not defined everywhere (autodiff doesn’t help). So LossFunctions.jl has this great feature of collecting non-trivial derivatives besides the value of the function.

Now the design questions regarding iterables of observations yhat and y, weights w and FUN in [value,deriv, deriv2]:

  1. Should we support FUN(loss, yhat, y, [w]) and default to mean aggregation? Is there any way to inform other types of aggregation in this syntax without dispatching on AggMode types?
  2. Should we adopt the alternative syntax mean(loss, yhat, y, [w]) instead? What about the deriv and deriv2 cases?

It is not clear to me yet which path is the most productive.

Regarding the case where loss(yhat, y, [w]) returns a vector of values, I agree with @aplavin that we should use broadcast instead. Whichever syntax we choose, we should avoid implicit broadcasting and let Julia handle the “vectorization” for us.

2 Likes

The PR is shaping up nicely. I fixed the functor interface for scalar arguments. Now we just need to brainstorm the aggregation API a bit further. It seems to me that the necessity to model deriv and deriv2 will end up forcing us to go with AggMode types, but I may be wrong.

I’m not sure I understand the utility of deriv. In the context of Flux, wouldn’t that be handled automatically by Zygote (or other AD)? In the context of MLJ.evaluate!, I don’t think we need deriv.

Some loss functions are defined by parts and are not dofferentiable everywhere. LossFunctions.jl writes the derivatives manually but people are free to use autodiff as well on top of the value(loss, yhat, y) when that is possible.

1 Like

AFAIK, the ChainRules package is the go-to way to define custom derivatives. It’s supported by lots of packages, and utilized by multiple AD systems.
Why not define LossFunctions derivatives through ChainRules? Then there’s no need for separate deriv – neither for developers, nor for users to learn.

2 Likes

I will take a look at ChainRules.jl. Maybe the best plan forward is to consider the API for the value function with sum and mean like in StatsBase.jl and leave the refactoring of deriv and deriv2 for a second PR with ChainRules.jl

I’ve updated the PR to adopt the simple sum and mean functions as opposed to the AggMode types. All tests are passing. We just need to update the docs and add tests for iterables that are not AbstractVector.

1 Like

The PR has been merged now with the AggMode module gone :pray:

Next I would like to remove the value function in favor of the functor interface always.

Done. If anyone wants to take a look at the master branch of LossFunctions.jl I think we are ready for another breaking release with removed AggMode and removed value function. Anything else we could break right away before the next release?

The work around deriv and deriv2 I will leave for a future opportunity. I am not used to the ChainRules.jl ecosystem, and someone can help there.

2 Likes

@aplavin do you have an example of how we could replace deriv and deriv2 by the ChainRules.jl approach? I could then copy/paste the changes throughout the package.

1 Like

I’ve released LossFunctions.jl v0.10 with the latest breaking changes.

2 Likes