Is there a function in Flux to estimate the best learning rate for a good gradient descent before training a neural network?
No.
Or yes, if you are willing to accept the defaults of the optimizers.
(Which TBH I normally am)
In general this is not possible, sorting this out is the job of hyperparameter optimization.
@oxinabox It’s inaccurate to state “in general this is not possible”; of course it is!
There is a fast.ai package (Understanding Learning Rates and How It Improves Performance in Deep Learning | by Hafidz Zulkifli | Towards Data Science), developed by Jeremy Howard that does exactly this, so I believe it would be in the best interest of Flux users to have such a functionality in the Flux library.
I hoping Flux developers are listening!
That is not what a I would call estimating the learning rate before training the network.
That is changing the learning rate during training.
i.e., learning rate schedualling (and smarter varients there of, maybe)
Which is a different thing.
I assumed you were asking about determining the optimal (initial) learning rate.
Anyway, I am pretty sure flux doesn’t have that yet,
but you can implement it by hand without too much trouble.
Like in this example
Anyway, I agree Flux should have convienence helpers for this.
Particularly for more complicated varients.
@ oxinabox I humbly stand corrected!
Thank you for the example!
Doesn’t ADAM automatically work out the learning rate adaptively? So you don’t need to specify the learning rate like in SGD.
@xiaodai You are right about Adam modifying the learning rate adaptively.
Yet, the code at this URL: Function in Flux to estimate learning rate - #5 by samq
uses Adam as the optimizer, but changes the learning rate anyway here:
# If we haven't seen improvement in 5 epochs, drop our learning rate:
if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
opt.eta /= 10.0
@warn(" -> Haven't improved in a while, dropping learning rate to $(opt.eta)!")
# After dropping learning rate, give it a few epochs to improve
last_improvement = epoch_idx
end
The reasoning behind doing this can be found at this URL: neural network - Should we do learning rate decay for adam optimizer - Stack Overflow
Doesn’t ADAM automatically work out the learning rate adaptively? So you don’t need to specify the learning rate like in SGD.
My understanding is that this is less important for ADAM, but some papers have shown that it does still help.