There are plenty of methods to solve a minimization problem. Many of them iteratively solves for the parameters by decrementing them by a fraction (determined by a learning rate) of the partial derivatives of the objective function. Why not just solve an ODE instead of inventing optimizers? What can go wrong with this? MWE is below.
using LinearAlgebra, OrdinaryDiffEq, Random
Random.seed!(1234)
# Solve U for A*U = B, that is argmin (1/2) (A*U - B)'(A*U-B) {+(1/2)U'U for minimum norm}
A = rand(50, 5)
U = rand(5, 2) # actual solution
B = A*U
AtA = A'*A
AtB = A'*B
p = (.-AtA, AtB)
function st!(du, u, p, t)
AtA, AtB = p
du .= AtB # .- u for minimum norm solution
mul!(du, AtA, u, 1.0, 1.0)
return nothing
end
U0 = rand(size(U)...)
tspan = (0.0, 100.0)
prb = ODEProblem(st!, U0, tspan, p)
sol = solve(prb, Tsit5(), dense=false, save_everystep=false, dt=1e-5)
sum(abs2, B .- A*sol.u[end])
It gets a crude solution but this may not be best example. Advantages, it has adaptive stepping, checks instability, uses higher order approximation and even we can use it to find steady-state solution. Disadvantages, 7-8 (I think) function evaluations per step but has 5th(I think) order approximation (by just function evaluations) while familiar optimizers at most goes up to 2nd order(Hessians). What are your thoughts? Is there any research on this?
1 Like
Yeah, there is a bunch of papers doing things like this. It’s just a bad idea: ODE solvers are made to follow a curve, optimizers don’t care about the curve but just the endpoint. The order, adaptivity etc. are just not relevant. Furthermor the ODE you’re using is essentially gradient descent, which is a bad optimization method. There are tricks you can do in optimization inspired from the ODE viewpoint (eg inertia) so it’s certainly a good perspective to keep in mind, but it’s generally a bad idea to use ODE solvers to minimize things.
10 Likes
Hi @antoine-levitt thanks for the discussion. I know it is just gradient descent and I just wanted to make up simplest example. Do you mind sharing such papers? I don’t see how following a curve and finding the endpoint should clash. Also, do you have a concrete argument that it is a bad idea in general if you wan to share?
The heuristic is that repurposing X for doing Y when people trained in X have been using Z for Y for decades is a bad idea, unless you have a good reason to believe the contrary (where here X is diff eqs, Y is optimization and Z is other optimization methods). Basically, solving the gradient flow is strictly harder than finding a local minimum, so there’s no reason to believe that solving the gradient flow will be more helpful than methods that target minimization directly. More concretely the gradient does not point very accurately towards the minimum; finding something that does point towards the minimum (for a quadratic function) points you towards the Newton flow, but at that point you want to take steps=1 so the ODE viewpoint is not very useful.
Regarding papers, I would be surprised if there are not a thousand papers describing this for machine learning, but the only concrete examples I know of are in computational quantum physics, where it’s known under the poetic name of “imaginary time”. As I said, it’s definitely a useful point of view for getting insight (an example is the quantum Monte Carlo method, where you reformulate a minimization problem as a PDE and then exploit the relationship with stochastic differential equations) but not for designing concrete methods for general-purpose minimization.
4 Likes
As pointed out by @antoine-levitt, what you are proposing is a kind of gradient descent, except with very small stepsizes. Compared to conventional (quasi-Newton, trust-region) methods, you are losing efficiency on two fronts:
-
gradient descent has significantly worse (local) convergence properties,
-
a lot of unnesessary evaluations are done for interim points you do not really care about.
Also, the lackluster performance for a simple linear system should be a warning: it only gets worse for difficult nonlinear problems.
1 Like
I still don’t see how they clash. I should clarify more I think. I am not saying that we should follow plain gradient descent. What we can do here is gradient flow follows Newton flow as close as possible or exactly. Then, step=1 is not important since ODE solvers can take large steps in linear problems. However, there is also nonlinear problems that you traditionally follow Newton flow locally but we usually do not see further and do not take large steps. It may help these cases. When I made up the example in the OP, I am not really saying du
has to be the gradient descent or the problem has to be linear or quadratic whatever. It is just shown as an example. We still can incorporate ideas from the sate-of-the-art existing optimizers here for what the du
can be.
Please don’t take this the wrong way, but numerical optimization is a very well established field. Generations of researchers have studied these problems in detail, characterizing the convergence properties and trade-offs of various solvers. Improving on the current best algorithms may take more than some heuristic idea.
If you want a nice introduction, I would recommend
Nocedal, Jorge, and Stephen Wright. Numerical optimization . Springer, 2006.
2 Likes
Well absolutely. You start with your ODE solver which you tell to follow gradient descent. Then you change the descent direction to a more appropriate one. Maybe you use the Hessian to follow a suitably regularized Newton flow, maybe you incorporate some curvature information with a quasi-Newton algorithm. Then you change the stepsize control to adapt the stepsize based on a more sensible criterion than ODE trajectory accuracy - for instance, the decrease of the objective function. Maybe you can even ensure you don’t take too small steps with a clever line search. Congratulations, you’ve now implemented a state-of-the-art optimizer.
I’m not saying that what you suggest is intrinsically a bad idea; it’s just that this is a thing that has been studied to death over decades, and just plugging in an ODE solver for an optimization problem is not by itself a sufficiently new idea to beat standard schemes. However you never know: sometimes by having sufficiently different ideas you can get something useful - for instance http://users.jyu.fi/~pekkosk/resources/pdf/FIRE.pdf, is supposedly better than standard algorithms at certain type of optimization problems.
1 Like
I suppose I am somewhat cynical, but arxiv is full of papers describing optimization algorithms that happen to be beat existing methods for a few specific test problems. I have implemented quite a few myself and found that it is worth being a bit conservative, especially if one cares about worst case scenarios in a broad class of nonlinear problems.
1 Like
in addition to the good points mentioned already, following the surface in high dimensions is going to lead you to a local optimum. even things like LBFGS from the Optim package was blown out of the water by BlackBoxOptim using differential evolution strategies on a problem I was working on yesterday because the problem had all sorts of local minima
1 Like
Dismissing the idea upfront as if repurposing is a bad idea and then saying sufficiently different ideas will get you somewhere is not constructive. I even got a book advice . What a great day to start a discussion. I was not expecting such cynicism here. With this type of cynicism, Julia
has not been invented when there are plethora of other languages, Python
is the most prominent. I know I should not expect just encouragements. However, I expected some improvements on the idea at least. @dlakelan if the problem is nonconvex you’ll get local minima in LBFGS. That’s not the problem here. I am not claiming it is panacea for all the problems.
Well I mean you asked for feedback, I’ve given you my feedback respectfully, as have other people here. I won’t claim to be an optimization or diff eq expert, but I did implement a few of these methods, taught them and even have a few papers in related areas. Unbridled optimism is good for blue sky fields ; it’s less useful for something that has been around as long as numerical optimization. Again, feel free to dismiss what we’re saying completely : you’ll certainly learn a lot more this way, at least. Just don’t post on forums in this case.
I just want to mention that of all the projects you could have picked to illustrate your point you’ve picked one of the absolute worst. Julia was invented with laser sharp focus to solve a very specific point (the two-language problem in numerical computing), and generalized from there.
@Tamas_Papp Yeah I share your skepticism, that’s why I said supposedly
2 Likes
In addition to what @antoine-levitt said: if you want people to buy into a proposal to the extent that they invest time on improving it, you may want to consider rigging up at least demonstration. You can find a collection of problems at eg
1 Like
I should say that turning a problem into a differential equation, and then looking at the property of the paths that the differential equation takes to discover facts about a mathematical object is a super effective strategy for sampling from many Bayesian posterior distributions. This is known as Hamiltonian Monte Carlo and for many problems it’s the best thing going. So you might want to look at that field.
A key difference there is that you actually do care about the function at all the points along the path… as opposed to optimization where ideally you’d just consult an oracle and it’d tell you that at location x the minimum occurs.
1 Like
FWIW, that’s just using a method for stiff ODEs on the same gradient flow, since the Jacobian of the gradient gives you the Hessian which is then used to step. This means Rosenbrock-W methods are quite close to doing exactly what you described. But yes, the issue is that their adaptivity is doing the wrong thing, since it assumes you really care about following the trajectory instead of finding the minimum value.
2 Likes
There are papers showing that standard adaptive schemes will not reproduce correct long-time dynamics and in particular not converge to an equilibrium. I can’t remember the names :(. If somebody happens to know them I’d be grateful for the references.
I sent this thread to a friend who wrote
possibly relevant: this work ([1805.07199] Explicit Stabilised Gradient Descent for Faster Strongly Convex Optimisation) constructs a generalisation of gradient descent by discretising the gradient flow ODE using a Runge-Kutta-Chebyshev method. the point is not that this lets you track the flow more accurately, but that this choice of discretisation (designed for solving stiff ODEs with an explicit integrator) lets you use a larger time-step, and converge more quickly. some similar ideas are explored in ([1805.00521] Direct Runge-Kutta Discretization Achieves Acceleration)
and I thought I’d pass on the message in case those ideas are of interest.
3 Likes
You can do this by trying ROCK2 or ROCK4. It’s a fun research project.
1 Like
Thanks a ton. I’ll look into them.
1 Like
It’s not cynicism, it’s simply people telling you that you are a newcomer to the field and that the basic idea you are proposing was tried and rejected decades ago. This happens a lot when anyone tries to propose new ideas in relatively mature fields.
It’s possible that something was missed in this area, of course, and that those old ideas should be revisited! But as a basic rule of thumb, if you want to resurrect an old idea, you should first go back and understand why people stopped using it (some of the reasons were summarized by @antoine-levit for you). (Maybe those old reasons for rejecting it are no longer valid due to new developments, or maybe you can invent a variant that circumvents the problem!) If you don’t do that, then you are likely to waste a lot of time rediscovering the same difficulties.
PS. In density functional theory (DFT) for solid-state physics, the steepest-descent ODE approach is called “molecular dynamics” energy minimization, and it was largely abandoned by the 1990s in favor of nonlinear conjugate-gradient methods; see e.g. this review article. There are “damped” dynamics variants or “momentum” corrections (as they are called for stochastic gradient descent algorithms like Adam), however, that turn out to be closely related to conjugate gradient, but IIRC they tend to have more specialized uses (e.g. optimizing noisy functions).
9 Likes