Any suggestion on how to handle deriv
and deriv2
in this approach @aplavin ? Currently we can perform these aggregations with first and second derivatives as well.
The design space here is a bit difficult to navigate, but I think we are converging into something that has all the benefits of nice Julian syntax, nice performance and clean implementation.
I think youâre losing sight of the actual use case for these loss functions. Within the context of MLJ, loss functions (called âmeasuresâ in MLJ) are used for evaluating machine learning models, with a typical use case looking like this:
evaluate!(
mach,
resampling = CV(nfolds=5),
measure = [rms, mae]
)
With the approach youâre advocating, I would have to write that like this:
evaluate!(
mach,
resampling = CV(nfolds=5),
measure = [
(yĚ, y) -> sqrt(mean(L2DistLoss(), yĚ, y)),
(yĚ, y) -> mean(L1DistLoss(), yĚ, y))
]
)
If we take your approach to its logical conclusion, no one would ever be allowed to write a function that takes an AbstractArray
argument (unless that AbstractArray
represents a mathematical vector, matrix, or tensor). But thatâs a bit pedanticâfactoring out commonly used expressions into separate functions is a normal part of programming.
All that being said, my feeling is that these metric functions are so trivial that thereâs not much to be gained by attempting to get Flux and MLJ to use the same loss function implementation. (Is that the implicit goal of LossFunctions.jl?)
You are quoting me on implicit broadcasting methods, but your example is on the aggregation. These two are effectively independent. Maybe the latter should stay, idk, but the former should really go away.
Agree with you @CameronBieganek that there is still value in considering the vector-based methods. We are really brainstorming here to see what is the best compromise, nothing decided yet.
It would be extremely helpful for the community to maintain a common set of loss functions in a single repository that is shared among different ML frameworks. It is ideal scenario of course, but I think we should put energy into that direction.
Regarding aggregation: I only compared this to value(L1DistLoss(), yĚ, y, AggMode.Mean())
. Sure, you can pack this into MAE()
, either way.
Let me summarize the status quo so that we can organize the discussion moving forwardâŚ
Currently we have loss functions in LossFunctions.jl that are represented with structs, and this is useful to hold state and hyperparameters that sometimes exist. Given a loss
object we can do 3 things with it:
- value: the actual value loss(yhat, y)
- deriv: the first derivative
- deriv2: the second derivative
These derivatives are written by hand, and sometimes are not defined everywhere (autodiff doesnât help). So LossFunctions.jl has this great feature of collecting non-trivial derivatives besides the value of the function.
Now the design questions regarding iterables of observations yhat
and y
, weights w
and FUN in [value,deriv, deriv2]
:
- Should we support
FUN(loss, yhat, y, [w])
and default tomean
aggregation? Is there any way to inform other types of aggregation in this syntax without dispatching on AggMode types? - Should we adopt the alternative syntax
mean(loss, yhat, y, [w])
instead? What about thederiv
andderiv2
cases?
It is not clear to me yet which path is the most productive.
Regarding the case where loss(yhat, y, [w])
returns a vector of values, I agree with @aplavin that we should use broadcast instead. Whichever syntax we choose, we should avoid implicit broadcasting and let Julia handle the âvectorizationâ for us.
The PR is shaping up nicely. I fixed the functor interface for scalar arguments. Now we just need to brainstorm the aggregation API a bit further. It seems to me that the necessity to model deriv and deriv2 will end up forcing us to go with AggMode types, but I may be wrong.
Iâm not sure I understand the utility of deriv
. In the context of Flux, wouldnât that be handled automatically by Zygote (or other AD)? In the context of MLJ.evaluate!
, I donât think we need deriv
.
Some loss functions are defined by parts and are not dofferentiable everywhere. LossFunctions.jl writes the derivatives manually but people are free to use autodiff as well on top of the value(loss, yhat, y) when that is possible.
AFAIK, the ChainRules
package is the go-to way to define custom derivatives. Itâs supported by lots of packages, and utilized by multiple AD systems.
Why not define LossFunctions derivatives through ChainRules? Then thereâs no need for separate deriv
â neither for developers, nor for users to learn.
I will take a look at ChainRules.jl. Maybe the best plan forward is to consider the API for the value function with sum and mean like in StatsBase.jl and leave the refactoring of deriv and deriv2 for a second PR with ChainRules.jl
Iâve updated the PR to adopt the simple sum and mean functions as opposed to the AggMode types. All tests are passing. We just need to update the docs and add tests for iterables that are not AbstractVector.
The PR has been merged now with the AggMode module gone
Next I would like to remove the value
function in favor of the functor interface always.
Done. If anyone wants to take a look at the master branch of LossFunctions.jl I think we are ready for another breaking release with removed AggMode and removed value function. Anything else we could break right away before the next release?
The work around deriv and deriv2 I will leave for a future opportunity. I am not used to the ChainRules.jl ecosystem, and someone can help there.
@aplavin do you have an example of how we could replace deriv and deriv2 by the ChainRules.jl approach? I could then copy/paste the changes throughout the package.
Iâve released LossFunctions.jl v0.10 with the latest breaking changes.