Yes, thatās right, thatās not very well summarized.
The main design ideas I wanted to explore are the following: I think the common thread is ārecycling as much existing interface as possible before inventing more of itā.
Models (and gradients) live in an inner product space
This means one should be able to (at least) sum, scale, take inner products between models of the same type, out of the box. When computing gradients, they will have exactly the same structure as the model they refer to: and this kind of makes sense, since they are objects living in the same space, that you will want to add and subtract or whatever.
(I have explored using Array-like structures from RecursiveArrayTools and ComponentArrays here, but ended up doing it in a custom wayāI donāt remember exactly why, but for sure I wanted to get my hands dirty with custom broadcasting.)
In any case, this makes models and associated gradients very similar to regular arrays, and as a consequence one can use
Generic optimization algorithms
Optimization algorithms do not rely on any specific interface for models. You just give them the objective function to be minimized and the initial point: the latter can be an Array
, or can be a more structured, callable object (that we can call āmodelā), that the objective function knows how to evaluate. The optimization algorithm is agnostic with respect to the nature of the space it is exploring, and does its job without asking questions.
I havenāt fully explored this: Iād like to verify whether algorithms from e.g. Optim.jl work out of the box here. But thatās the hope, right? Anything that simply requires vector space operations should work with objects offering vector space operations. For example, L-BFGS should work, if implemented without too many restrictions.
Optimization algorithms as iterators
This is nothing new of course, but I like it a lot: once theyāre given an objective and a starting point, gradient descent & co. are iterators with clearly defined state and output structures. One can loop through them and make decisions based on the output: whether to break the loop, or compute validation metrics, or adjust step sizes, or just log information. But one shouldnāt be computing any forward/backward passes or explicitly ask the algorithm to update parameters: the algorithm is already taking care of that as you iterate it.
Iāve always felt the āusualā implementation (in deep learning frameworks) of gradient descent & co. to be a bit counterintuitive. To use them, one needs to do everything: evaluate the objective, make sure the backward pass is done, then call some method so that the algorithm updates the parameters (this is usually one or two lines, depending on how much additional āstateā the algorithm needs to track). This seems a bit unnatural to me, given what these optimizers do: they minimize some function, so letās give them the damn function.
Even more iterators
Step size: can be a number, but itās more generally a sequence (so, an iterator). Nesterov momentum parameter: for convex functions, thatās usually a rather particular sequence (encoded as an iterator).
What if the step size needs to be dynamically adjusted? (āLearning rate schedulingā) Well, one can use a Settable iterator to be able to change its value at any time.
āāāāāāāāā
(Iāll update this post in case anything else comes to mind)