How to best model different domains in Optimization, especially Manifolds?

Thanks for your explanation. I would carefully claim that the algorithms covered in the nice book by Absil et al., are all available in Manopt.jl (with exponential map or your favourite retraction), the same holds for their mentioned manifolds and Manifolds.jl.
And thanks for the concrete example though the ideas behind Grassmann are first quite abstract, that is a nice example where one does get to a nice array-representation,

1 Like

That does look very comprehensive. I haven’t been doing work involving manifolds recently, but I’m sure I’ll play with this anyway. And I probably will return to something involving manifolds again. I published a paper on my previous project, but I still had a lot of questions I couldn’t answer and probably will return to later. I’m sure this package would have saved me a lot of time and probably will in the future. Would it fill in the exponential map as a default retraction for something like M=ProductManifold(Grassmann(n,k1), Grassmann(n,k2))?

(This is a small detour here, but
You could even write M = Grassmann(n,k1) × Grassmann(n,k2) (using the × operator); since default_retraction_method(Grassmann(n,k1)) is the PolarRetraction, this would be used on both components in solvers by default as a retraction, on the product manifold even element wise :slight_smile:

1 Like

Let me nuance a little bit the statement “everything should be an array in JuMP” or even “things should be parametrized by numbers”.
For instance, SetProg extends JuMP to set variables (InfiniteOptis another good example).
When the user creates a set variable, he does not specify how it is parametrized with numbers, and you are also not allowed to any operation (the sum of two ellipsoids is not an ellipsoids so you also cannot sum them otherwise you leave the manifold so the situation is quite similar ;))
The user can add inclusion constraints and membership constraints and can specify the objective via volume etc…
Then, SetProg finds good way to parametrize with numbers and rewrite it as a conic program and pass it to a conic solver through MOI.
In MOI, everything is represented sets that are subset of R^n.
Once you choose a representation of the points of your manifold, you can always do that.
For an SVD, you can concatenate the vectorization of the three matrices for instance.
Here are a few examples:

  • For hermitian matrices, we concatenate the real part of the upper triangular part of the matrix with the imaginary part of the strictly upper triangular part of the matrix
  • For implications b => f(x) in S, we concatenate the binary variable b (represented as 0 or 1 so it’s also a real number) with the value of f(x) and the set is Indicator(S).

Whatever is the API, the user needs to communicate at least two things:

  • The starting values
  • The objective function

The starting values will be a vector of starting values for the parametrization.
The objective function will be a function that, given the vector of values of the parametrization, return the value of the objective.

The parametrization does not need to be an array, it can be any struct really at the JuMP level but at the MOI level, you will take all the real numbers parametrizing this struct, decide of an order between them and pass a vector of it.

I would say the only restriction in the MOI/JuMP design is that decision variables are real numbers.
So if you want the variable to represent some point in a manifold that does not support +, you will not be able to communicate this point as decision variable in MOI but you can be with such construct at the JuMP level.
If you want the user to be able to have such nice syntax, you can build a JuMP extension doing that but the JuMP extension will have to decide, when JuMP.optimize! is called (so this decision can be made when you have the whole model, it does not have to be decided at the creation of each constraint (that is important for SetProg)), you need to decide what is the best parametrization (or the user to decide of that as an option) and then use that to communicate the problem through MOI.

So more concretely, then the user does:

@variable(model, X[1:3, 1:3] in FixedRankMatrices(2))

the parametrization is decided upfront, the advantage is that the user can specify the objective function and starting values in the embedding.
It does not mean that the solver cannot internally do everything with an SVD representation!
But the user will specify the objective with respect to these variables. You can get the objective as an expression tree (where the AD still needs to be applied), or as callbacks (with AD already done) and then try to get the objective function w.r.t. other parametrization if you like.

If the user does:

@variable(model, X in FixedRankMatrices(2))

then you can define your own object and have full control with multiple dispatch over what’s the behavior of that object. You will store custom info in the model.ext field and set a model.optimize_hook that will call a function when optimize! is called to parse the objective (now in terms of your custom object X, not in terms of the entries of a matrix representation!), then you can pick an SVD representation of X and compute gradient callbacks etc… with respect to this representation. Then, you pass in MOI a vectorization of this SVD representation. At the MOI solver level, you receive the vector of starting values (that you can then reshape into the SVD representation of the starting value) and objective value and gradient callbacks that take as input this vectorization. Now, if you don’t want to vectorize and reshape and give the SVD representation as a callback you could just wrap it in some custom struct that is a subtype of AbstractVector and then recognize it at the JuMP level an unwrap. If you just pass the SVD representation any wrapping then it will only work with the callbacks you create from your JuMP extension and uses MOI directly and creates some custom callback expecting the vector representation, it might fail.

6 Likes

So, I’m not a card-carrying Manifold Person, but I did code up the implementation in Optim, used a fair bit of optimisation on stiefel/grassman and even wrote a couple of papers developing and analyzing algorithms for this, so let me give my perspective on this. There are two ways of looking at manifold optimization: the proper way and the poor man’s way.

In the proper way (eg Manopt, OptimKit), you distinguish carefully between points in the manifold and vectors on the tangent space, and you keep track of the base point of every tangent vector. You can’t directly do x+v for a manifold point x and a tangent vector v, but you can do retract(x,v) which moves on the manifold in the direction v. You can’t do v1+v2 or dot(v1, v2) if v1 and v2 are tangent vectors at points x1 and x2, but you can use parallel transport to do it. Etc etc.

In the poor man’s way (eg Optim), you don’t care too much about these subtleties, and add things together as you wish, then go back to the manifold when you need to; so you define two functions retract(x) that takes a point close to the manifold and puts it back on the manifold (compare with retract(x,v) in the proper way, the correspondance being retract(x,v)=retract(x+v)), and project_tangent(x,v) that projects a vector onto the tangent space at x. You make sure to project_tangent any vector that should be on the tangent space (eg gradients, or vectors from tangent spaces at other points), and to retract any time you move on the manifold.

The poor man’s way corresponds exactly to a special case of the proper way when the manifold is embedded in Euclidean space, and when the retraction and parallel transport are chosen in a particular way. Historically, it precedes the proper way, which came as a rationalization of the poor man’s way (see eg “geometry of algorithms with orthogonality constraints”). The advantage is that it’s much easier to modify existing codes and port existing algorithms. It covers sphere, Stiefel and Grassman very well (in my tests, I did not find any significant difference between Optim’s poor man Stiefel LBFGS vs others). I still have not really understood whether there are applications that really do require the full formalism of the proper way, I would welcome a clear explanation of an example of this.

Also some hard-won insights I’ve noticed that are hard to find in books and papers:

  • The choice of the retraction, transport, etc mostly does not matter, all perform about the same. There is nothing magical about exponential maps, they are just one type of retraction (that happens to be more natural mathematically). Pick the one that is most efficient and stable numerically.
  • For the purposes of optimization, forget completely about Grassmann. Rather, pretend you have not noticed that your objective function only depends on the subspace and not the particular basis, use Stiefel and relax about the fact that the solution will not be isolated. There is mathematically no difference between first-order optimization algorithms on Stiefel and Grassmann (ie they will produce the exact same iterates). Second-order optimization algorithms will have a degenerate Hessian, but that’s fine, because the rhs will have zero components on the nullspace of the Hessian. Your optimization algorithm should in any case have a way to filter the zero eigenvalues of the Hessian (either explicitly, or implicitly by using an iterative method to solve the Hessian linear system). It’s a bit like if you have a standard flat optimization problem min F(x,y) but F only depends on x. The optimizer will just leave the y component alone and converge to a solution (which is not isolated, but so what?). This has a bad reputation in optimization because all the standard results talk about isolated minimizers, but it’s really completely fine in most circumstances (typically when you have analytic degeneracies. I think it would be very nice to flesh out the mathematical theory of degenerate optimization problems (eg show that algorithms converge to sets rather than points), so that we as a community can learn to stop worrying and love the degeneracies. If anybody is looking for a fun research project in optimization…
4 Likes

Yes, sorry, that was my impression after the work on the current PR, but you are right, I first of all still fight (with myself) to understand similarities, differences, properties and limitations of MOI and JuMP.

One problem that I do see with the parametrisation – which is what is informally phrased as “everything must be vectorised” is for example with Hyperbolic which has 3 different parametrizations. So how would MOI know whether it does work on Hyperboloid / Poincare Halfplane or Poincare ball based vectors?.

None of the manifolds (besides Euclidean and it subspaces) do support neither + nor *.
Two points(decision variables? I am again struggling with topic specific language) can not be combined – you need points (p) and tangent vectors (X) for that and even then, its is exp(M, p,X) instead of p+X but sure the Xare from a vector space (attached and maybe different for every p) so a*X works fine

For me this sentence means we can not do a MOI interface for Manopt, because since we need to have a + this would only work for Eucliodean – and for that you have much better toolboxes.

But this also means that when passing from Manopt to MOI we have to embed and when coming “back” we have to SVD? That would for best of cases only be reasonably performant – most probably not at all.

This sounds like the way to go for me then, whouhg the vectorisation again for me is what I meant with “everything has to be vectorized” – and hence brings back the question on Hyperbolic.

Just again a remark on the used terminology, your retract(x) would be called projection (of x from the embedding onto the manifold) in Manopt (not be confused with a pullback of a metric or such), Your projet_tangent is indeed also a project in Manopt.jl/Manifolds.jl

For the rest I agree, the “poor-mans-way” – or maybe the one that does implicitly assume a bit more (an embedding, that the metric stems from restricting the embedded metric, that you use the retraction-by-projection,…) – works totally fine in those settings :slight_smile:

I do not completely agree with this, retractions are a first order approximation to the exponential map. That means for small steps they are “good enough”. The same for inverse retractions and log or vector transports and parallel transport. For example parallel transport assures that the angle between 2 vectors that you transport stays the same – vector transports do not keep that, or in other words: parallel transport is the only vector transport that does that. So if you need that in your algorithm, it does matter.
Otherwise – sure, take the retraction that is for your case the most efficient one, that is one of the reasons we decided to implement every retraction we can find :wink:

Again, the might often work well. But if you have a gradient, that just “points into” a direction, where the basis X (Stiefel interpretation) would rotate in the space, this gradient has a non-zero-“Stiefel-Norm” why it zero w.r.t. the Grassmann norm and hence you are at a critical point.
The best counter example here is, if you try to run Newton’s method (first order but sure approximates second order things) on Rayleigh’s quotient on Rn – this will just run to a solution with infinite norm in many cases.
I think I am mainly characterising your wording here – because you are indeed right with your last sentences, there are often degeneracies which are “nice enough” so that “things just work” and sure this would defnelty need more optimisation research (I would be interested as well). Until then, being too inprecise in these sounds a bit like “yeah we divide by zero here, but it still works somehow”.
The area one would have to work a lot on then is probably Quotient manifolds.

The vectorization is only needed to communicate the starting values and solution vector if you want to use the VariablePrimalStart and VariablePrimal MOI attributes. If you want to use these MOI attributes, you will need to vectorize hence the JuMP extension should vectorize in it optimize_hook. You could also define your custom ManifoldVariablePrimalStart and ManifoldVariablePrimal that set the start and receive the solution in a non-vectorized format if you like.

What I mean is that it cannot be communicated as an MOI.VariableIndex so if you want to use MOI.VariableIndex you need to pass a vectorization through MOI. The other option is to just use custom MOI attributes and not use MOI.VariableIndex as explained above.

If it’s not performant to communicate the starting values and solution in that parametrization then you can use another one like the SVD one, it does not have to be represented as an array. I just have built-in support for array-like in JuMP and a few LinearAlgebra structs and it’s easy to add support for any other types of struct

As mentioned above, we could make take the road of not vectorizing and not use MOI.VariableIndex but I would first try to see if vectorizing in the JuMP extension is enough.
At the end of the day, you will also at some point choose a representation and do the optimization in the solver so why not make this choice at the level of the JuMP extension ?

1 Like

So for me this reads like – some (easier?) manifolds can use the MOI variables you speak about, but the question then would be: Does it make sense to do an easy and an advanced way to “speak to” Manopt? I am not sure.

My algorithms internally handle all cases of hyperbolic completely the same in a completely abstract way, because in the end only exp(M, p, X) (read as “walk into tirection of X to take the iterative step)”) will dispatch on the types of p and X. So my solvers are agnostic of the representation you choose, Manifolds.jl does care about that (dispatch on the right function and return the right type hence).

So vectoring the JuMP extension will not cover all manifolds, and some only with quite some “implicit knowledge” .vectorise the SVD representation would be something e implicitly know about a vector without actually knowing it.
But this is exactly what I wanted to discuss here – how to specify a domain or have points that are not vectors. For a general interface to all manifolds w.r.t Manopt, this will be necessary.

In my (limited) experience the proper formalism also doesn’t look important for sphere, Stiefel and Grassmann. There is, however, a long history of people re-discovering tricks for Euclidean optimization that are basic ideas in Riemannian optimization such as natural gradient descent or balancing in fixed rank matrix optimization. So, while everything can be handled optimally by a properly adapted Euclidean optimizer, why ignore a source of good ideas?

1 Like

Even if Manifolds.jl is written in an abstract way, you will still decide of one representation when you do the optimization and you will call it with some concrete type. Even if the solvers are agnostic, the user will be able to specify one choice at some point and it will solve for this choice (correct me if I am wrong).
So this choice could be made at the JuMP extension level.
The advantage of this is that, you could model your problem with the Sphere manifold and then use Ipopt. Because it’s vectorized, Ipopt can solve the problem by transforming the Sphere constraint into a ScalarQuadraticFunction-in-EqualTo.
I’m failing to see which case is not covered by this. Do you sometimes change the representation during the solve ?

Yes of course, this choice can be made on the JuMP extension level.

Your example is a very good one, you would have to “convert” the manifold into a constraint. Something like this can be done in an embedding with most manifolds.

There is actually two cases where this might fail

  • in a very abstract sense the manifold might not be an embedded one, that is, it is not described by constraints in some embedding. This is quite mathematical and for now I could not come up with a good example, since for computations at some points you come at least to something number-like and this you could see as an embedding – but this might also just be something locally around the point you are at – or in other words: Your parameterisation (real-numbers/decision variables) might be completely different (even in size and shape, but not in degrees of freedom) depending on where you are on the manifold
  • if you only accept vector-like representations, how do you distinguish different representations? This could probably be adopted by proper “conversion” of models/variables/… between domains, though for the Hyperbolic case I do not yet see how. For others (also for fixed rank) there is embed(M,p) which actually returns the fixed rank metric (for p the svd-point) and project(M,X) which takes a fixed rank one and dvd’s it. While this is costly (so should not be done every iteration) it is nice to “convert” probelms for other solvers

The conversion to other solvers is also something I see as a huge advantages, when Manopt.jl is able to be compatible with MOI/JuMP, besides being also available for a larger audience.

This is fine, JuMP will just say that Ipopt does not supports it. JuMP has a list of bridges and then is able to detect if it can combine bridges in order to reformulate into something a solver support. The goal is that the user can describe the problem with as much structure the problem has without altering to ability to solve this problem with solver that’s very generic and does not know anything about this structure.
One similar example is the SecondOrderCone which is kept as is for conic solvers, reformulated into ScalarQuadraticFunction-in-LessThan for quadratically constrained solvers (like Gurobi) and further reformulated into value/gradient/hessian callbacks for Ipopt.

3 Likes

I think we discussed this before, but this is only a viewpoint, where you see the exponential map as “the thing that we want to do” and everything else as an approximation. But really in optimization you don’t particularly care about exponential maps, you just want to move around on the manifold, so who cares if it’s the exponential or some other thing.

So if you need that in your algorithm, it does matter.

But why would you need this? As far as I’m aware, there is no difference either on the theoretical side (you get the same convergence guarantees in both cases) or in the practical side.

The best counter example here is, if you try to run Newton’s method (first order but sure approximates second order things) on Rayleigh’s quotient on Rn – this will just run to a solution with infinite norm in many cases.

Huh, this is a good point, for some reason I thought this would not happen but indeed it does as you say. Indeed second-order optimization blows up with (curved) degeneracies, because the hessian is only approximately singular (the thing I had in mind is only valid at critical points). So there there is a real difference between Stiefel and Grassmann. Very sorry about that, and thanks for the correction!

But if you have a gradient, that just “points into” a direction, where the basis X (Stiefel interpretation) would rotate in the space, this gradient has a non-zero-“Stiefel-Norm” why it zero w.r.t. the Grassmann norm and hence you are at a critical point.

That doesn’t happen, though. Suppose you’re doing a Stiefel optimization and you’re at a point X which is a critical point for the Grassmann interpretation. The Stiefel tangent space T splits into an orthogonal sum T1+T2, where T1 is the tangent space of the Grassmann (“horizontal tangent space” in bundle language) and T2 is the “internal rotations” (vertical tangent space). The gradient is zero on T1 because it’s a critical point for the Grassmann interpretation, and it’s zero on T2 because the objective function doesn’t change wrt internal rotations.

So the gradient method is exactly the same for Stiefel and Grassmann, and doesn’t randomly rotate the basis, simply because the gradient doesn’t want them to. For things like LBFGS it’s not actually true (because you’re importing information from previous steps, which does have a component on the internal rotations), but it’s pretty small and gets killed by the retraction anyway. In my tests it never seemed to matter.

There is, however, a long history of people re-discovering tricks for Euclidean optimization that are basic ideas in Riemannian optimization such as natural gradient descent or balancing in fixed rank matrix optimization. So, while everything can be handled optimally by a properly adapted Euclidean optimizer, why ignore a source of good ideas?

Oh for sure, the Riemannian viewpoint is super elegant and a very good viewpoint to have, I’m just saying that for code you don’t need the whole formalism, which can be a bit annoying to implement properly, and is a bit of a turnoff for people who don’t particularly care about manifold stuff. So, when trying to implement manifold functionality into more traditional optimization packages, it’s useful to be more pragmatic and do the poor man’s version, which is not actually that much of a loss.

2 Likes

Well, in comparison, if you have p+X and quite a few things that mimic + (the smaller the values of X the closer you get) – then you could also say – you do not care about +.
Mathematically exp is the single thing that realises (walk along) shortest paths. What you want to have numerically is something that is efficient to compute (and exp might not be) and as close as possible to exp, because otherwise you are inprecise, which for best of cases just costs you a few iterations, for worst of cases makes your algorithm not work.

As soon as you have Hessians involved you already need second-order-retractions (that is they not only approximate the exponential to first order but even to second order), otherwise your algorithms will not converge…

No worries, we are on a slight detour from defining domains, but I love these discussions, to also get different views on how people work with / think about / … optimization on manifolds. You are correct, that one often prefers numerically to check for retractions for example and other simplifications to be faster – just that one has to be careful.

Note that the Rayleigh quotient is also constant on the ray of eigenvalues and Newton fails – so the constant function might have effects – but you are correct, the gradient would indeed be zero there as well, sorry I was not careful enough there!

I think there is a tradeoff here. If you are not careful enough – or have the more complicated manifolds where the whole formalism is actually necessary – you might run into errors, that might be hard to track down. They happen due to “not being exact enough”.
My personal goal is to care about the whole formalism but also allow an “easy entry” (i.e. do all the “annoying” details myself so others do not have to), so that people that do not care, can at least say “that piece of software will do it right”.
And sure this is my personal oppinion – I would like to provide it correct. The pragmatic approach (of maybe leaving out some nasty details - let’s maybe not call it too often poor-man), might lead people into wrong “belief” (for missing a better word here in English)?

Maybe as an example: We do have for example default_retraction_method(M) that returns the (currently available) most reasonable retraction (in the sense of tradeoff between fast and accurate), so when calling gradient_descent(M, cost, gradient) this one is chosen, and the user does not have to care about retraction/exponential map / … details.

I think we both have the same goal … making manifolds more accessible – just opinions on the “how” are different. But thanks for all your arguments, it is always good to hear other ideas as well and to rethink ones own :slight_smile:

Why do you say this? You can definitely derive and implement proper Newton algorithms that are quadratically convergent without “second-order retractions” (I’ve done it on the Grassman manifold with the “orthogonalization retraction”). The Hessian is intrinsically defined as a quadratic form on the tangent space (without reference to a retraction or a connection or any of this) at critical points (this is not true elsewhere), so near a critical point it doesn’t matter which retraction you take?

Oh, maybe you need it for the convergence rate not for convergence itself.
So it does converge, but probably slower than if you take a good one.

Nope, it converges quadratically all the same. If you’re at a distance eps of a critical point, the leading order (first order in eps) of xn+1 - xn is determined by the Newton correction equation, which is intrinsic (does not depend on the retraction), so any difference between retractions appears at second order in eps, which does not affect the convergence rate.

Hm, then I remember that wrongly and have to check the literature. I remember that for either convergence or convergence rate of quasi newton (or was it trust regions?) the second order part was necessary.

edit: Yes. Absil, Mahony Sepulchre, p. 116. You get superlinear convergence of Newtons method with arbitrary retractions (that is first-order ones) and quadratic convergence with second-order ones.

That’s very marginally a better bound, you get |xn+1 - x*| <= C |xn-x*|^2 in both cases, it’s just a smaller C with a second-order retraction. So you always converge quadratically, it’s just the technical proof get slightly easier (and therefore you get a smaller constant) with a second-order retraction. I doubt the magnitude of C has any relevance other than for the simplicity of the proof (it’s just a bound anyway, it’s not expected to be tight…)