Learning rate scheduler with the new interface of Flux

Hi,

I’ve not been actively using Flux.jl for a while and I found that the interface of Flux has changed a bit.

Here is the docs of Flux, “Scheduling Optimisers”.
It seems like that using ParameterSchedulers.jl is recommended.
However, I cannot find a way to set the lower bound of gradient norm in ParameterSchedulers, while the previous version of Flux provided such functionality by default.

What would be the best practice for this?

I’m not quite sure I understand the question. Are you asking for a way to schedule the δ parameter of Optimisers.ClipGrad (which is equivalent to the legacy Flux.Optimise.ClipValue)? If so, use adjust! as shown here.

@ToucheSir Sorry for the confusion.

I have changed the interface from implicit to explicit (referred to as in Optimisation Rules · Flux).

So, the optimiser is set up using setup.
For example,

opt_state = Flux.setup(optimiser, network)

Here, optimiser is Adam(lr) for the given learning rate lr.

Now, I’d like to use a learning rate scheduler for the optimiser.
For example, I may be able to write a code as described here:

optimiser = Flux.Optimiser(ExpDecay(lr, 0.90, 1000, 1e-5), Adam())

But the setup seems not compatible with the composition of optimisers in the above way. The error message is:

ERROR: Flux.setup does not know how to translate this old-style implicit rule to a new-style Optimisers.jl explicit rule

I’d like to replace optimiser = Flux.Optimiser(ExpDecay(lr, 0.90, 1000, 1e-5), Adam()) with one compatible with the explicit interface.

As mentioned on the top of that Optimisation Rules page, you should be looking at the Optimisers.jl docs and not that page if you’re working with explicit params. If you consult the adjust! docs I linked above (https://fluxml.ai/Flux.jl/stable/training/reference/#Optimisers.adjust!), that should be enough to show how to do this.

Since this topic has come up again in a different thread, I took the liberty to write out what this would look like. The following is a combination of the Flux quick start and the example in ParameterSchedulers.jl.

using Flux, Optimisers, ParameterSchedulers


noisy = rand(Float32, 2, 1000)                                    # 2×1000 Matrix{Float32}
truth = [xor(col[1]>0.5, col[2]>0.5) for col in eachcol(noisy)]   # 1000-element Vector{Bool}

model = Chain(
    Dense(2 => 3, tanh),   # activation function inside layer
    BatchNorm(3),
    Dense(3 => 2),
    softmax)

target = Flux.onehotbatch(truth, [true, false])                   # 2×1000 OneHotMatrix
loader = Flux.DataLoader((noisy, target) |> gpu, batchsize=64, shuffle=true);

const lr = 0.01
optim = Flux.setup(Flux.Adam(lr), model)  # setup optimizer as usual
sched = Stateful(Step(lr, 0.9, 100)) # setup schedule of your choice

for epoch in 1:1_000
    for (x, y) in loader
        loss, grads = Flux.withgradient(model) do m
            y_hat = m(x)
            Flux.crossentropy(y_hat, y)
        end
        Flux.update!(optim, model, grads[1])

        # NEW
        nextlr = ParameterSchedulers.next!(sched) # advance schedule
        Optimisers.adjust!(optim, nextlr) # update optimizer state, by default this changes the learning rate `eta`
    end
end
2 Likes