Yes, thatâ€™s right, thatâ€™s not very well summarized.

The main design ideas I wanted to explore are the following: I think the common thread is â€śrecycling as much existing interface as possible before inventing more of itâ€ť.

##
Models (and gradients) live in an inner product space

This means one should be able to (at least) sum, scale, take inner products between models of the same type, out of the box. When computing gradients, they will have exactly the same structure as the model they refer to: and this kind of makes sense, since they are objects living in the same space, that you will want to add and subtract or whatever.

(I have explored using Array-like structures from RecursiveArrayTools and ComponentArrays here, but ended up doing it in a custom wayâ€”I donâ€™t remember exactly why, but for sure I wanted to get my hands dirty with custom broadcasting.)

In any case, this makes models and associated gradients very similar to regular arrays, and as a consequence one can use

##
Generic optimization algorithms

Optimization algorithms do not rely on any specific interface for models. You just give them the objective function to be minimized and the initial point: the latter can be an `Array`

, or can be a more structured, callable object (that we can call â€śmodelâ€ť), that the objective function knows how to evaluate. The optimization algorithm is agnostic with respect to the nature of the space it is exploring, and does its job without asking questions.

I havenâ€™t fully explored this: Iâ€™d like to verify whether algorithms from e.g. Optim.jl work out of the box here. But thatâ€™s the hope, right? Anything that simply requires vector space operations should work with objects offering vector space operations. For example, L-BFGS should work, if implemented without too many restrictions.

##
Optimization algorithms as iterators

This is nothing new of course, but I like it a lot: once theyâ€™re given an objective and a starting point, gradient descent & co. are iterators with clearly defined state and output structures. One can loop through them and make decisions based on the output: whether to break the loop, or compute validation metrics, or adjust step sizes, or just log information. But one shouldnâ€™t be computing any forward/backward passes or explicitly ask the algorithm to update parameters: the algorithm is already taking care of that as you iterate it.

Iâ€™ve always felt the â€śusualâ€ť implementation (in deep learning frameworks) of gradient descent & co. to be a bit counterintuitive. To use them, one needs to do everything: evaluate the objective, make sure the backward pass is done, then call some method so that the algorithm updates the parameters (this is usually one or two lines, depending on how much additional â€śstateâ€ť the algorithm needs to track). This seems a bit unnatural to me, given what these optimizers do: they minimize some function, so letâ€™s give them the damn function.

##
Even more iterators

Step size: can be a number, but itâ€™s more generally a sequence (so, an iterator). Nesterov momentum parameter: for convex functions, thatâ€™s usually a rather particular sequence (encoded as an iterator).

What if the step size needs to be dynamically adjusted? (â€śLearning rate schedulingâ€ť) Well, one can use a Settable iterator to be able to change its value at any time.

â€”â€”â€”â€”â€”â€”â€”â€”â€”

(Iâ€™ll update this post in case anything else comes to mind)