Request to upgrade to LossFunctions.jl

juliohm · April 21, 2023, 12:06pm

Dear ML community, we released LossFunctions.jl v0.9 with a few important breaking changes:

Reversed order of arguments to match other ecosystems, loss(yhat, y) is now the order.
Removed the ObsDim business to support a more general interface with iterables of observations.
Removed OrdinalMarginLoss to support a more general interface with CategoricalArrays.jl].
Dropped unnecessary dependencies, the only dependency is CategoricalArrays.jl now.

These changes will allows us to sync the widely tested loss functions in LossFunctions.jl with loss functions defined in other ecosystems such as Flux.jl. I would like to invite every ML contributor in Julia to try to move their own loss function implementations to LossFunctions.jl and join efforts maintaining the package.

Our next goal is to formalize the support for general datasets with known observation dimension (e.g. Tables.jl). Previously datasets were assumed to be n-dimensional arrays, which is quite limiting and low-level.

aplavin · April 21, 2023, 5:54pm

Wonder what’s the motivation for implicit broadcasting? Julia doesn’t really do that elsewhere…

juliohm · April 21, 2023, 6:00pm

Can you elaborate?

aplavin · April 21, 2023, 6:04pm

From the docs:

julia> value(L2DistLoss(), 1.0, 0.5)
0.25

julia> value(L2DistLoss(), true_targets, pred_outputs)
3-element Array{Float64,1}:
 0.25
 4.0
 1.0

Following Julian approach, the latter should be value.(...), with a dot.

juliohm · April 21, 2023, 6:51pm

In this case we want to preserve the vectorized version in order to optimize aggregation methods. Check the AggMode examples where the result is usually summed up.

Optimizations can dispatch on different types of dataset to aggregate the scalar version more efficiently.

aplavin · April 22, 2023, 6:52pm

Aggregation is a different story, I was only talking about the value method that takes and returns arrays. Julia specifically avoids defining “implicitly broadcasted” methods, as there is no reason to do this.

Aggregation is completely orthogonal to that. Although, instead of

value(L1DistLoss(), [1,2,3], [2,5,-2], AggMode.Sum())

one of the following would be cleaner and not require learning new symbols/objects (AggMode, Sum):

# minimal change from yours:
value(L1DistLoss(), [1,2,3], [2,5,-2], sum)
# like Julia Base map(f, [...], [...]), but aggregate:
sum(L1DistLoss(), [1,2,3], [2,5,-2])

juliohm · April 23, 2023, 12:15pm

I like the proposed alternatives and we could consider them in the next breaking release. The only thing I don’t see yet is how to incorporate weights in these summations? It is easy when the losses are summed up without weights as we can simply rely on sum and mean from Base.

aplavin · April 23, 2023, 12:30pm

You can follow StatsBase and do

mean(L1DistLoss(), [1,2,3], [2,5,-2], weights([1, 2, 1]))

Also: see the already existing dims argument of sum/mean instead of introducing ObsDim.

juliohm · April 23, 2023, 12:39pm

I think you missed the point of this release. We already removed ObsDim, and are continuously updating the code to look more Julian:

github.com/JuliaML/LossFunctions.jl

Refactoring of codebase

opened 01:25PM - 09 Apr 20 UTC

juliohm

enhancement discussion

Dear all, I would like to propose a major refactoring of the codebase to *sim…plify* future additions and generalizations, and to facilitate future contributions. I think we are overusing macros in the package with the only purpose of sharing docstrings among various losses, and we could eliminate this entry barrier to potential contributors. In particular, I would like to suggest a few modifications here, and ask for your approval before I start submitting PRs. # Suggestions of improvement 1. Can we get rid of the `value_fun`, `deriv_fun`, `deriv2_fun`, `value_deriv_fun` functionality? I understand that these functions were created in the past because the language didn't have efficient lambdas and closures, etc. Moving forward, I think we could stick to a single interface for evaluating losses `value`, `deriv`, and `deriv2` where the last two functions could have fallback implementations via auto-diff when the user only implements the `value` function. 2. Similarly, can we get rid of the following functor syntax in the top source file, and stick to the single API defined above? ```julia # allow using some special losses as function (loss::ScaledSupervisedLoss)(args...) = value(loss, args...) (loss::WeightedBinaryLoss)(args...) = value(loss, args...) # allow using SupervisedLoss as function for T in filter(isconcretetype, subtypes(SupervisedLoss)) @eval (loss::$T)(args...) = value(loss, args...) end # allow using MarginLoss and DistanceLoss as function for T in union(subtypes(DistanceLoss), subtypes(MarginLoss)) @eval (loss::$T)(args...) = value(loss, args...) end ``` 3. Can we get rid of the internal (no-exported) types `Deriv` and `Deriv2`? I understand that they are only used internally for plotting derivatives in `src/supervised/io.jl`. 4. Can we simplify the test suite? Currently, it seems to be testing the same functionality with hundreds of numbers giving the illusion of good coverage and making any run of the tests take forever when tests fail (IO bottlenecks). 5. Can we simplify the loop in `src/supervised/supervised.jl` that loops over `value`, `deriv` and `deriv2`? In particular, I am curious if we could only define (without metaprogramming) the aggregation of `value`, and then rely on auto-diff to compute `deriv` and `deriv2`? This is a modification that we need to think more carefully, but that could simplify the codebase tremendously. If auto-diff does not work for all losses, we can always provide a specific implementation to overwrite the auto-diff defaults. My question is, can auto-diff be performed at compile time? Do we get any slowdown if we follow this design? I will start working on separate PRs for items (1) (2) (3) and (5). I need your input before I can start working on (5).

We will consider the StatsBase.jl approach for weights in our next brainstorming phase.

aplavin · April 23, 2023, 12:58pm

Nice, so that must be a docs issue? I’ve been looking at the “latest” docpage at Efficient Sum and Mean · LossFunctions.jl, it should be there right? As the release already happened:

juliohm · April 23, 2023, 1:07pm

Yes, probably a docs issue. We updated the docs but the build scripts did not deploy it apparently

CameronBieganek · April 23, 2023, 4:16pm

Most machine learning metrics do both a mapping and a reduction in order to calculate a scalar metric. I think it’s reasonable to encapsulate that in one function with a signature like mymetric(ŷ, y), where ŷ and y are iterables. And in fact that is what both MLJ and Flux do:

https://fluxml.ai/Flux.jl/stable/models/losses/

Personally, I don’t really see a need for a value function. It seems like regular old explicit functions like mae(ŷ, y) are good enough and a lot easier to read.

juliohm · April 24, 2023, 12:13pm

We are refactoring the package as a whole, it is likely that the next release will get rid of the AggMode submodule and will use a more Julian approach.

juliohm · April 24, 2023, 8:20pm

A big advantage of the AggMode currently implemented in LossFunctions.jl is that it doesn’t allocate intermediate arrays. We can dispatch on specific aggregation methods and reduce the terms of the aggregation without broadcasting a big vector of losses to be summed up later. Am I missing something?

Started cleaning up AggMode.None in this PR:

juliohm · April 24, 2023, 8:32pm

Thanks for pointing these out @CameronBieganek. The main difference I see between MLJ and Flux is that MLJ decided to place weight vectors as the third argument of the loss(ŷ, y, w), whereas Flux decided to place the weights in the aggregation function loss(ŷ, y, agg=x->mean(w .* x)). Moreover, MLJ defaults to no aggregation whereas Flux defauls to mean.

In terms of performance, the MLJ approach gives more opportunity to avoid memory allocations. For example, if the aggregation is sum we can simply

sum(wi * loss(ŷi, yi) for (ŷi, yi, wi) in zip(ŷ, y, w))

without ever allocating intermediate arrays.

I am tempted to implement the MLJ approach, but am open to mroe input before proceeding.

juliohm · April 24, 2023, 8:36pm

I am starting to consider that maybe the AggMode.Sum, AggMode.Mean and AggMode.WeightedSum should be preserved as types in order to dispatch more efficient implementations for the different types of aggregations. Would be happy to be convinced otherwise.

CameronBieganek · April 24, 2023, 10:06pm

Well, many MLJ metrics use aggregation, but some do not. They have a trait reports_each_observation that specifies whether or not aggregation is used.

juliohm · April 24, 2023, 10:12pm

Thanks for clarifying. We will certainly not follow this pattern. We want the same behavior always and extra options to change the behavior to something else if necessary.

aplavin · April 24, 2023, 11:18pm

I think the “implicit broadcasting” methods (like L1DistLoss()([1,2,3], [2,5,-2])) should just be removed, and users directed to regular Julia broadcasting.
This change doesn’t affect aggregations at all!

Regarding aggregations:

Big advantage with respect to what?
Surely you can dispatch stuff like

sum(L1DistLoss(), [1,2,3], [2,5,-2])
mean(L1DistLoss(), [1,2,3], [2,5,-2], weights([1, 2, 1]))

to exactly the same kind of code as the current value(L1DistLoss(), [1,2,3], [2,5,-2], AggMode.Sum()) does. With exact same performance.

juliohm · April 24, 2023, 11:24pm

aplavin:

Surely you can dispatch stuff like
sum(L1DistLoss(), [1,2,3], [2,5,-2])
mean(L1DistLoss(), [1,2,3], [2,5,-2], weights([1, 2, 1]))
to exactly the same kind of code as the current value(L1DistLoss(), [1,2,3], [2,5,-2], AggMode.Sum()) does. With exact same performance.

I think we can assume that these are the only two useful aggregation functions, i.e. sum and mean with optional weights, and then get rid of all aggregation types in LossFunctions.jl. Thanks for the suggestion.

Topic		Replies	Views
ANN: LossFunctions.jl Machine Learning package , announcement	3	1338	February 17, 2017
Performance comparison - Flux.jl's Adam vs Jax's Adam Performance question , package , performance , flux	33	3901	November 3, 2022
Custom loss functions in `Lux.jl` Machine Learning enzyme , lux , loss-function	5	298	February 10, 2025
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4927	April 29, 2020
`Zygote.gradient` is 54000 TIMES slower than `jax.gradient` Optimization (Mathematical) zygote , jax	80	1339	February 1, 2025

Request to upgrade to LossFunctions.jl

Related topics