[ANN] ProtoGrad.jl, the (n+1)th ML/optimization package

Hey there :wave: I just published ProtoGrad, a package for gradient-based ML and optimization (yes, another one).

It is not registered. I developed it in my spare time (such a scarce resource!) mainly as a playground for design ideas, and I’m announcing it here in case someone finds any of those useful or interesting. As such, it is far from feature complete (whatever that means), and things may change.

It’s based on Zygote and NNLib, so don’t be surprised if it bears any resemblance to Flux or other analogous, great packages.

The README tells a bit about the main ideas (I won’t copy-paste it here) and a couple examples are there to make it more concrete: that’s the only documentation-like material at the moment, but hey! That’s at least something!

Let me know what you think!

15 Likes

The readme is great and easy to follow along, one thing I wish was somewhere either in the readme or here is how you perceive the core idea you explored in this package to differ from that of (some of the) other similar packages? is it because you have gradients themselves be models or rather this iterators approach for GD?

2 Likes

Yes, that’s right, that’s not very well summarized.

The main design ideas I wanted to explore are the following: I think the common thread is “recycling as much existing interface as possible before inventing more of it”.

Models (and gradients) live in an inner product space

This means one should be able to (at least) sum, scale, take inner products between models of the same type, out of the box. When computing gradients, they will have exactly the same structure as the model they refer to: and this kind of makes sense, since they are objects living in the same space, that you will want to add and subtract or whatever.

(I have explored using Array-like structures from RecursiveArrayTools and ComponentArrays here, but ended up doing it in a custom way—I don’t remember exactly why, but for sure I wanted to get my hands dirty with custom broadcasting.)

In any case, this makes models and associated gradients very similar to regular arrays, and as a consequence one can use

Generic optimization algorithms

Optimization algorithms do not rely on any specific interface for models. You just give them the objective function to be minimized and the initial point: the latter can be an Array, or can be a more structured, callable object (that we can call “model”), that the objective function knows how to evaluate. The optimization algorithm is agnostic with respect to the nature of the space it is exploring, and does its job without asking questions.

I haven’t fully explored this: I’d like to verify whether algorithms from e.g. Optim.jl work out of the box here. But that’s the hope, right? Anything that simply requires vector space operations should work with objects offering vector space operations. For example, L-BFGS should work, if implemented without too many restrictions.

Optimization algorithms as iterators

This is nothing new of course, but I like it a lot: once they’re given an objective and a starting point, gradient descent & co. are iterators with clearly defined state and output structures. One can loop through them and make decisions based on the output: whether to break the loop, or compute validation metrics, or adjust step sizes, or just log information. But one shouldn’t be computing any forward/backward passes or explicitly ask the algorithm to update parameters: the algorithm is already taking care of that as you iterate it.

I’ve always felt the “usual” implementation (in deep learning frameworks) of gradient descent & co. to be a bit counterintuitive. To use them, one needs to do everything: evaluate the objective, make sure the backward pass is done, then call some method so that the algorithm updates the parameters (this is usually one or two lines, depending on how much additional “state” the algorithm needs to track). This seems a bit unnatural to me, given what these optimizers do: they minimize some function, so let’s give them the damn function.

Even more iterators

Step size: can be a number, but it’s more generally a sequence (so, an iterator). Nesterov momentum parameter: for convex functions, that’s usually a rather particular sequence (encoded as an iterator).

What if the step size needs to be dynamically adjusted? (“Learning rate scheduling”) Well, one can use a Settable iterator to be able to change its value at any time.

—————————

(I’ll update this post in case anything else comes to mind)

5 Likes

Great summary, thanks! :raised_hands:

How does this compare against using explicit parameters in Flux? I noticed that ProtoGrad.jl/gradient.jl at main · lostella/ProtoGrad.jl · GitHub is very similar to GitHub - FluxML/Functors.jl: Parameterise all the things in both form and spirit, for example.

1 Like

I think this is a great way to organize the API. Wish more packages did this.

1 Like

I also really like the “optimizers as iterators” approach. For reference, it was also proposed for Optim here: Add iterator interface by tkf · Pull Request #745 · JuliaNLSolvers/Optim.jl (github.com).

I’m curious: do you also plan to support iterators from Optim.jl? The limitation there IMO is that those iterators require the parameters to be in a flat vector (rather than in a model object). If ProtoGrad handled that, I think it’d be quite nice to be able to use classical (eg quasi-Newton) and batched optimizers with a similar syntax.

EDIT: never mind, reading more carefully, it looks like you want to apply classical optimizers directly in model-space, because it has a inner product. That is a nice idea! You may want to have a look at Jutho/OptimKit.jl: OptimKit: A blissfully ignorant Julia package for gradient optimization (github.com) for that. It has fewer optimizers than Optim. jl, but does not require that the parameters are stored as a flat vector (it instead asks you to explicitly tell it what inner product to use).

4 Likes

It’s not that different in the end, but I wanted to see to what extent we can avoid any mechanism to access parameters other than just “attributes”: essentially, moving any hyperparameter into the struct type itself. I can see how this may turn out to be limiting, since the distinction between parameters and hyperparameters is not set in stone.

But drawing a line between parameters and hyperparameters allows you to have vector operations directly on structures, which ideally enables you to use whatever general purpose gradient-based optimization algorithm implementation to optimize a model. This can also obtained with explicit parameters declaration, I just thought the “recursively descend through all attributes” approach would be conceptually simpler; another advantage might be that whenever you try to sum two structs that are incompatible (because they differ in one hyperparameter) you get a type error.

Definitely, that’s really interesting. I probably only scratched the surface of Flux, I’ll definitely look into Functors because it looks very much related. Thanks for the pointer!

1 Like

I haven’t tried yet, but if you say that they rely on the decision variable to be stored in a Vector then it won’t probably work.

However that’s maybe an artificial limitation, at least for some algorithms: things like L-BFGS should work out of the box if you just have sum, scaling, and inner product.

Whenever I have time I’ll also try to add examples to those points :wink:

1 Like