Yes, that’s right, that’s not very well summarized.
The main design ideas I wanted to explore are the following: I think the common thread is “recycling as much existing interface as possible before inventing more of it”.
Models (and gradients) live in an inner product space
This means one should be able to (at least) sum, scale, take inner products between models of the same type, out of the box. When computing gradients, they will have exactly the same structure as the model they refer to: and this kind of makes sense, since they are objects living in the same space, that you will want to add and subtract or whatever.
(I have explored using Array-like structures from RecursiveArrayTools and ComponentArrays here, but ended up doing it in a custom way—I don’t remember exactly why, but for sure I wanted to get my hands dirty with custom broadcasting.)
In any case, this makes models and associated gradients very similar to regular arrays, and as a consequence one can use
Generic optimization algorithms
Optimization algorithms do not rely on any specific interface for models. You just give them the objective function to be minimized and the initial point: the latter can be an Array
, or can be a more structured, callable object (that we can call “model”), that the objective function knows how to evaluate. The optimization algorithm is agnostic with respect to the nature of the space it is exploring, and does its job without asking questions.
I haven’t fully explored this: I’d like to verify whether algorithms from e.g. Optim.jl work out of the box here. But that’s the hope, right? Anything that simply requires vector space operations should work with objects offering vector space operations. For example, L-BFGS should work, if implemented without too many restrictions.
Optimization algorithms as iterators
This is nothing new of course, but I like it a lot: once they’re given an objective and a starting point, gradient descent & co. are iterators with clearly defined state and output structures. One can loop through them and make decisions based on the output: whether to break the loop, or compute validation metrics, or adjust step sizes, or just log information. But one shouldn’t be computing any forward/backward passes or explicitly ask the algorithm to update parameters: the algorithm is already taking care of that as you iterate it.
I’ve always felt the “usual” implementation (in deep learning frameworks) of gradient descent & co. to be a bit counterintuitive. To use them, one needs to do everything: evaluate the objective, make sure the backward pass is done, then call some method so that the algorithm updates the parameters (this is usually one or two lines, depending on how much additional “state” the algorithm needs to track). This seems a bit unnatural to me, given what these optimizers do: they minimize some function, so let’s give them the damn function.
Even more iterators
Step size: can be a number, but it’s more generally a sequence (so, an iterator). Nesterov momentum parameter: for convex functions, that’s usually a rather particular sequence (encoded as an iterator).
What if the step size needs to be dynamically adjusted? (“Learning rate scheduling”) Well, one can use a Settable iterator to be able to change its value at any time.
—————————
(I’ll update this post in case anything else comes to mind)