Parallel newton-raphson optim

parallel
optim

#1

Anyone has implemented newton-raphson (BFGS or L-BFGS) in parallel in Julia? The flow is super simple but somehow I don’t see how to do it with Optim.
It is as follows given x of dimension N. Use N+1 processors, one processor evaluates the function at the f(x) and the remaining calculate each element of the gradient.
thanks


#2

This doesn’t seem to have much to do with Optim, but rather how he objective is evaluated?


#3

No, this is suggesting serial evaluation of f, but doing that in parallel with the gradient calculation(s).


#4

If you calculate each element of a gradient separately, and it is so expensive that you want to parallelize it, you are doing it wrong.

Computing all the elements of the gradient should be about as expensive as computing the objective function once. Google “adjoint method”.


#5

Thanks. and interesting. I want to spend more time to understand. Now, It is very common to calculate a gradient numerically by taking a small perturbation of the function around the current guess. And It is also very common for fortran/c/matlab users to parallelise Newton-Rapson type of method by sending a perturbation to each processor. We might all be wrong doing it that way but that I would appreciate some more explanation.


#6

This is called a finite-difference approximation. Some people do this when for various reasons they don’t want to spend the effort to implement an analytical derivative. However, in my experience you are usually better off with a “derivative-free” or “black box” optimization algorithm (to which you only supply the objective function, and typically it figures out internally how to approximate the gradient) than trying to use a gradient-based algorithm with finite-difference approximations.

But definitely if it is so expensive that you are parallelizing the finite-difference gradient approximation, you should consider computing the gradient analytically. If you have only a few variables, almost any analytical derivative technique should be fine, and the ForwardDiff.jl package can even do it automatically for you (for sufficiently generic pure-Julia code). If you have lots of variables, you should definitely use an adjoint method. Again, in some cases the adjoint method can be implemented automatically by “reverse mode” autodifferentiation, e.g. the ReverseDiff.jl package.

I posted some notes on adjoint methods for one of my courses: they are actually quite simple to use, once you get used to it (and in essence is just an ordinary chain rule with re-arranged parentheses): http://math.mit.edu/~stevenj/18.336/adjoint.pdf

(For example, many people, including me, used these techniques to do PDE-constrained “topology optimization”, optimizing over hundreds of thousands of variables where the objective function is solving a complicated PDE like Maxwell’s equations. With an adjoint method, you solve the PDE once to get the objective value, and then solve only one more “adjoint” PDE problem to get the derivative with respect to all the variables simultaneously.)


#7

This is a great article thanks for sharing. But isn’t this adjoint method related to solving equality constrained optimization problems with fewer degrees of freedom than variables? Basically it simplifies the derivative computation with respect to the degrees of freedom, given that we know exactly which variables are to be decided, which ones are physical state variables derived from our decisions, and the equations connecting them to each other. So how is this related to unconstrained optimization by Newton-like methods where essentially all variables are degrees of freedom?


#8

when I said “objective” I mean f, gradient, hessian, whatever you want to evaluate about f.


#9

Sorry this is a little off-topic now but interesting for me: Would you say this isreally true in general or typically? I am thinking e.g of quantum chemistry I agree with the statement provided the objective admits a variational principle i.e. you have Hellman-Feynman forces; but there are also other objectives that do not satisfy a variational principle and in those cases one goes from O(N^3) (diagonalise hamiltonian) to O(N^4) (perturbation theory w.r.t. \sim N variables).

It is possible of course the alternative methods exist that I am not aware of and which avoid perturbation theory somehow or use it in a more clever way.


#10

For concreteness, let’s say we want to differentiate I(x) = E(x,y(x)), where y satisfies F(x,y) = 0. Then I’ = dE/dx + <dE/dy, dy/dx> = dE/dx - <dE/dy, (dF/dy)^-1 dF/dx> = dE/dx - <(dF/dy)^-T dE/dy, dF/dx>. This is much better because you only have to invert a system of size ny once (which is about as expensive as a Newton step to solve F(x,y) = 0, and therefore about as expensive as an evaluation of I).

A special case is when F = dE/dy, which corresponds to the case where E is minimized wrt y. Then the second term vanishes and I’ = dE/dx, which is known in electronic structure as Hellman-Feynman.

@cortner, I guess you’re thinking of a case like DFPT where y is the first n eigenvectors of a (say) 100n x 100n matrix? In that case, computing dF/dy would amount to computing something like the polarizability operator using the Adler-Wiser sum-over-states formula, which needs the eigenvectors of the full 100n x 100n matrix. So you can indeed compute forces in O(n^3), but the prefactor is huge. I think Lin has a paper on this.


#11

Hi Antoine - thanks for the connection; this makes sense - maybe we can continue this over email.


#12

The adjoint method is equally applicable to constrained and unconstrained optimization, and in fact to any case where you have N variables and M << N scalar functions whose gradients you want (e.g. a single real objective function for unconstrained optimization). It is a standard technique in large-scale optimization.

And yes, it is generally true that it allows you to efficiently compute the gradient of any function. It has no requirement of a variational principle. This is explained in detail in the notes I linked (and I also have a companion set of notes for recurrence relations, and the Cai reference cited in the notes explains “time-evolution” via differential-algebraic equations). There is a general mathematical theory of “reverse-mode autodifferentiation” (also called “backpropagation” in machine learning) that underlies this.

Some familiar results of perturbation theory from undergraduate physics, like the Hellman–Feynman theorem, fall out as special cases of adjoint methods, but there are other cases that are less familiar.


#13

Interesting. I did not make the connection between the adjoint method and backpropagation before but perhaps the latter could be thought of as a special case of the adjoint method, I will give it a second look, thanks.


#14

PS. Adjoint methods work for functions of eigenvectors too, as explained in my notes. Contrary to what you may have learned in undergrad quantum perturbation theory, you do not need formulas involving sums over “all states” (getting all the eigenvectors is impractical in large problems). The key is that you are only computing the derivatives of a scalar function of the eigenvectors, so you can circumvent explicitly differentiating the eigenvectors themselves.

PPS. One caveat is that eigenvectors and eigenvalues cease to be differentiable at eigenvalue crossings. There are ways to deal with this using generalized derivatives (related to degenerate perturbation theory). Alternatively, in one of my papers we used a different technique to transform an eigenvalue-optimization problem (maximizing band gaps in 3d Maxwell problems with 10^6 degrees of freedom) into a sequence of SDPs: https://doi.org/10.1364/OE.22.022632


#15

@stevengj agreed that the method works on eigenvectors. You can either think of it as a root of (Ax = lambda x, |x| = 1) as you do, or as a critical point of <x,Ax>, but defined on the sphere.

I thought that it didn’t work well in the context we’re interested in because we have a function of many eigenvectors and I thought you’d have to either differentiate those or compute the full eigenvector decomposition, but now I think I was mistaken. Thanks, this might prove useful!

When there are crossings (between occupied and unoccupied states) you have bigger problems anyway, so this is not much of an issue.