An Idea for Optimization as if Solving ODE

I still don’t see how they clash. I should clarify more I think. I am not saying that we should follow plain gradient descent. What we can do here is gradient flow follows Newton flow as close as possible or exactly. Then, step=1 is not important since ODE solvers can take large steps in linear problems. However, there is also nonlinear problems that you traditionally follow Newton flow locally but we usually do not see further and do not take large steps. It may help these cases. When I made up the example in the OP, I am not really saying du has to be the gradient descent or the problem has to be linear or quadratic whatever. It is just shown as an example. We still can incorporate ideas from the sate-of-the-art existing optimizers here for what the du can be.

Please don’t take this the wrong way, but numerical optimization is a very well established field. Generations of researchers have studied these problems in detail, characterizing the convergence properties and trade-offs of various solvers. Improving on the current best algorithms may take more than some heuristic idea.

If you want a nice introduction, I would recommend

Nocedal, Jorge, and Stephen Wright. Numerical optimization . Springer, 2006.

2 Likes

Well absolutely. You start with your ODE solver which you tell to follow gradient descent. Then you change the descent direction to a more appropriate one. Maybe you use the Hessian to follow a suitably regularized Newton flow, maybe you incorporate some curvature information with a quasi-Newton algorithm. Then you change the stepsize control to adapt the stepsize based on a more sensible criterion than ODE trajectory accuracy - for instance, the decrease of the objective function. Maybe you can even ensure you don’t take too small steps with a clever line search. Congratulations, you’ve now implemented a state-of-the-art optimizer.

I’m not saying that what you suggest is intrinsically a bad idea; it’s just that this is a thing that has been studied to death over decades, and just plugging in an ODE solver for an optimization problem is not by itself a sufficiently new idea to beat standard schemes. However you never know: sometimes by having sufficiently different ideas you can get something useful - for instance http://users.jyu.fi/~pekkosk/resources/pdf/FIRE.pdf, is supposedly better than standard algorithms at certain type of optimization problems.

1 Like

I suppose I am somewhat cynical, but arxiv is full of papers describing optimization algorithms that happen to be beat existing methods for a few specific test problems. I have implemented quite a few myself and found that it is worth being a bit conservative, especially if one cares about worst case scenarios in a broad class of nonlinear problems.

1 Like

in addition to the good points mentioned already, following the surface in high dimensions is going to lead you to a local optimum. even things like LBFGS from the Optim package was blown out of the water by BlackBoxOptim using differential evolution strategies on a problem I was working on yesterday because the problem had all sorts of local minima

1 Like

Dismissing the idea upfront as if repurposing is a bad idea and then saying sufficiently different ideas will get you somewhere is not constructive. I even got a book advice :clap: :clap:. What a great day to start a discussion. I was not expecting such cynicism here. With this type of cynicism, Julia has not been invented when there are plethora of other languages, Python is the most prominent. I know I should not expect just encouragements. However, I expected some improvements on the idea at least. @dlakelan if the problem is nonconvex you’ll get local minima in LBFGS. That’s not the problem here. I am not claiming it is panacea for all the problems.

Well I mean you asked for feedback, I’ve given you my feedback respectfully, as have other people here. I won’t claim to be an optimization or diff eq expert, but I did implement a few of these methods, taught them and even have a few papers in related areas. Unbridled optimism is good for blue sky fields ; it’s less useful for something that has been around as long as numerical optimization. Again, feel free to dismiss what we’re saying completely : you’ll certainly learn a lot more this way, at least. Just don’t post on forums in this case.

I just want to mention that of all the projects you could have picked to illustrate your point you’ve picked one of the absolute worst. Julia was invented with laser sharp focus to solve a very specific point (the two-language problem in numerical computing), and generalized from there.

@Tamas_Papp Yeah I share your skepticism, that’s why I said supposedly :stuck_out_tongue:

2 Likes

In addition to what @antoine-levitt said: if you want people to buy into a proposal to the extent that they invest time on improving it, you may want to consider rigging up at least demonstration. You can find a collection of problems at eg

1 Like

I should say that turning a problem into a differential equation, and then looking at the property of the paths that the differential equation takes to discover facts about a mathematical object is a super effective strategy for sampling from many Bayesian posterior distributions. This is known as Hamiltonian Monte Carlo and for many problems it’s the best thing going. So you might want to look at that field.

A key difference there is that you actually do care about the function at all the points along the path… as opposed to optimization where ideally you’d just consult an oracle and it’d tell you that at location x the minimum occurs.

1 Like

FWIW, that’s just using a method for stiff ODEs on the same gradient flow, since the Jacobian of the gradient gives you the Hessian which is then used to step. This means Rosenbrock-W methods are quite close to doing exactly what you described. But yes, the issue is that their adaptivity is doing the wrong thing, since it assumes you really care about following the trajectory instead of finding the minimum value.

2 Likes

There are papers showing that standard adaptive schemes will not reproduce correct long-time dynamics and in particular not converge to an equilibrium. I can’t remember the names :(. If somebody happens to know them I’d be grateful for the references.

I sent this thread to a friend who wrote

possibly relevant: this work (https://arxiv.org/abs/1805.07199) constructs a generalisation of gradient descent by discretising the gradient flow ODE using a Runge-Kutta-Chebyshev method. the point is not that this lets you track the flow more accurately, but that this choice of discretisation (designed for solving stiff ODEs with an explicit integrator) lets you use a larger time-step, and converge more quickly. some similar ideas are explored in (https://arxiv.org/abs/1805.00521)

and I thought I’d pass on the message in case those ideas are of interest.

3 Likes

You can do this by trying ROCK2 or ROCK4. It’s a fun research project.

1 Like

Thanks a ton. I’ll look into them.

1 Like

It’s not cynicism, it’s simply people telling you that you are a newcomer to the field and that the basic idea you are proposing was tried and rejected decades ago. This happens a lot when anyone tries to propose new ideas in relatively mature fields.

It’s possible that something was missed in this area, of course, and that those old ideas should be revisited! But as a basic rule of thumb, if you want to resurrect an old idea, you should first go back and understand why people stopped using it (some of the reasons were summarized by @antoine-levit for you). (Maybe those old reasons for rejecting it are no longer valid due to new developments, or maybe you can invent a variant that circumvents the problem!) If you don’t do that, then you are likely to waste a lot of time rediscovering the same difficulties.

PS. In density functional theory (DFT) for solid-state physics, the steepest-descent ODE approach is called “molecular dynamics” energy minimization, and it was largely abandoned by the 1990s in favor of nonlinear conjugate-gradient methods; see e.g. this review article. There are “damped” dynamics variants or “momentum” corrections (as they are called for stochastic gradient descent algorithms like Adam), however, that turn out to be closely related to conjugate gradient, but IIRC they tend to have more specialized uses (e.g. optimizing noisy functions).

9 Likes

Thanks for the review paper. But the early reactions supposed that I don’t know the field or just barely getting started. I know mostly how it evolved or that I should not use the steepest descent everywhere. My example is just a naive example for easiness and fast prototyping. I myself coded for quasi-Newton methods to some extent while taking optimization course but, of course, there are plenty of places for improving myself.

For the abondenment of SD, we can still make du out of conjugate-gradient method or damped/momentum dynamics method. I just wanted to explore the idea of incorporating an ODE solver in between and get feedback on the research done similarly or some pointing to papers about it, not plain objections without concerete references as the first couple of responds.

I still think that utilizing ideas from ROCK-type methods (extensions to Runge-Kutta Chebyshev for better stability in the complex plane) and other methods utilized for stabilizing stiff ODEs could be helpful, since I haven’t really found methods that are very effective for highly ill-conditioned local optimization problems where a nonlinear preconditioner is hard to specify. That said, since you don’t care about “accuracy”, you might as well use a low order method, which then means that implicit Euler or Rosenbrock Euler methods might be the things to look at. And indeed, if you search the literature, you find recent research in proximal optimization methods have a lot of links to implicit Euler:

1 Like

Can ROCK really beat optimised momentum descent, heavy ball, and relatives?

Why not mix them? I didn’t say use the ROCK method directly, I said use the ideas from it. I haven’t found momentum based methods stable enough on any of the problems I’ve tried, but a ROCK type idea would remedy that by linking condition number estimate to stages and automatically choose a stage number to ensure each step is stable. If you do that on an ODE with momentum you can probably get the stabilization provided by traditional momentum based methods but be able to handle much higher condition numbers.

FWIW the instability of momentum based methods on highly ill conditioned problems was causing us a ton of problems recently, so we started looking into alternatives. The solution we are using now is Hessian-free Newton Krylov (which is why all of the differential equation solvers have second order sensitivity analysis for Hv products), but I am not convinced that momentum methods can’t be tweaked to be more stable, just ADAM, RMSProp, Nosterov, etc. didn’t seem to do it (BFGS is tricky as well because of the initial Hessian). Though I haven’t looked into whether there is anything that mixes proximal optimization, Hessian vector products, and momentum methods specifically to target ill conditioned problems, but I’d anyone has anything for this I’d be curious to get it setup with DiffEqFlux.

For DiffEqFlux, it might resemble adaptive control. Instead of finding cost at the final time. We could update parameters as if they are also states of the ODE along with normal states. Parameter derivatives can be from any of the optimizers. It might work or find locally acceptable parameters in each time step.