I am working on implementing the MoMo method for adaptive learning rates in momentum optimizers, and it requires an evaluation of the objective function within the update rule.
Until now I have been implementing my own optimization rules by creating a struct for the specific optimizer I’m working on and then creating a method of the “Optimisers.apply!” function in conjunction to a method of the “Optimisers.init” function to deal with my custom optimizer, as I have learned from the “Optimisers.jl” docs. The “Optimisers.apply!” takes in the optimizer, state, parameters and gradient as arguments, and the “Optimisers.init” function takes in the optimiser and the parameters as arguments. Until now everything worked perfectly because that’s all I needed.
But the MoMo algorithm does require the evalution of the objective function to complete the update. Is it possible to do that within the “Optimisers.apply!” and “Optimisers.init” functions, and, if not, is there another way to implement this algorithm using Flux?
I don’t know about the MoMo algorithm, but can confirm that by default the update step sees the state and the gradient, but not the loss.
One quick way to pass the loss would be to have the rule contain a Ref(0f0), which you write into before calling update!. I haven’t tried, but something like this:
struct DecayDescent <: Optimisers.AbstractRule
eta::Float64
loss::Base.RefValue{Float64}
end
rule = DecayDescent(0.1, Ref(NaN))
opt_state = Flux.setup(rule, model) # every Leaf should see the same Ref
obj, grads = Flux.withgradient(m -> lossfun(m(x), y), model)
rule.loss[] = obj # should change what's seen by each `apply!` in here:
Flux.update!(opt_state, model, grads[1])
No, the Optimisers.apply! function itself does not directly evaluate the objective function; instead, you must evaluate the objective function and compute its gradients before calling Optimisers.apply!. The apply! function’s purpose is to update the model’s parameters based on the gradients and the chosen optimization rule, not to calculate the objective function’s value or gradient. To achieve this, you typically use a function like Flux.withgradient, which calculates the objective function’s value and its gradients simultaneously before passing the gradients to the optimizer’s apply! function for the update step.