Optim; Hessian or Hessian free?

Hi,

In the documentation for the Optim package I note the following statement with regards to the choice of algorithm:

When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred.

Just wondering if anyone is able to quantify a little more (at least approximately), the circumstances when the dimensions is considered large? I’m solving a concave optimization problem (Probit regression model in the log domain) and currently using the Hessian Free LBFGS algorithm. Currently working on dimensions of 100 but hoping to scale it up 1000, 5000 or more depending on the optimization performance.

Thanks for any inputs

The nice modular API of Optim.jl should allow you to simply try out (and benchmark) both options at practically zero cost.

1 Like

Thanks @Tamas_Papp, yes you are right. Will have to do that at some point of time. But was wondering if there are rule of thumbs that we can follow when prototyping instead of having to derive and code the Hessian.

Any thoughts?

you can use Forward Differenciation, via forwardDiff.jl, to calculate the hessian and the gradient

2 Likes

If you do not have an explicit Hessian, it is usually not worth coding one. Quasi-Newton methods don’t need one anyway and are rather efficient in real-life problems. Theory characterizes local convergence (quadratic vs superlinear), but that is not the pimary concern in practice unless your problem is super well-behaved.

Generally, the curvature variation in your problem is at least as important as the dimension, if not more. A simple probit model in 10³–10⁴ dimensions should be manageable with either BFGS or even CG.

1 Like

You can just run it through ModelingToolkit.jl to automatically generate a symbolic version of your numerical function and then just compute the Hessian on that.

3 Likes

Just curious: it sounds like ModelingToolkit.jl providing the same functionality as ForwardDiff.jl when it comes to auto-differentation, is that correct?

I really like the interface for ForwardDiff.jl, but it’s my (poorly informed) understanding that forward differences are not very computationally efficient (assuming I read the documentation correctly) for the common stats problem of optimizing R^p \rightarrow R^1, which is a bit discouraging. Should we expect better performance out of ModelingToolkit.jl?

Small note: when optimizing high-dimensional problems, it is advised to use L-BFGS rather than BFGS. This is because BFGS saves an approximation of the Hessian, which becomes very large and slow to work with in high dimensions. L-BFGS saves only a low-rank approximation of the inverse of the Hessian, which both saves memory and computations.

2 Likes

No, ModelingToolkit.jl isn’t doing automatic differentiation, it is doing symbolic differentiation. It’s weird, because you never wrote the code in the symbolic framework, but what it’s doing is automatically converting numerical Julia code into symbolic Julia code through AST desugaring and then just simply using it like it was in a CAS. It’s best shown by an example:

http://juliadiffeq.org/DiffEqTutorials.jl/html/ode_extras/01-ModelingToolkit.html

This is definitely one of those “only in Julia” things :slight_smile:

2 Likes

Oh interesting. If I understand what you mean by that, it’s reading your code, translating it into something that a tool like Mathematica/Wolfram Alpha can read as an equation and then having the Mathematica-like tool take the derivative of that. Is that roughly correct?

I don’t really know how the internals of symbolic differentiation works, but my first guess would be that it does something like apply the chain rule a whole lot, which I would think is functionally very similar to auto-differentiation (although all these topics I only vaguely understand from the outside). If I’m wrong about that, then the real answer is “you just need to learn more about CAS systems” and I assume you’re not going to put up a full intro to CAS in a Julia Discourse thread :slight_smile:

@HarrisonGrodin is optimizing the crap out of this, and it’ll be a nice paper some day.

2 Likes

Symbolic differentiation is actually quite straightforward, once you see the trick; in fact, you can think of it as a special case of symbolic simplification.

Suppose you want to take the derivative of sin(x*x) with respect to x, which I’ll denote diff(sin(x*x), x). How would you start? The chain rule - we now have diff(x*x, x) * cos(x*x), since we know that \frac{d \sin y}{dy} := \cos y. Now, let’s take out diff(x*x, x) - treating * as a binary function and applying the chain rule, we get x*diff(x,x) + diff(x,x)*x. Finally, we take out each diff(x,x), replacing with 1. Our result now looks something like (x*1 + 1*x) * cos(x*x). There’s a bit more simplification to do here - removing the unnecessary *1s, we get (x + x) * cos(x*x).

How do we automate this, though? The answer is surprisingly straightforward; we just encode the rules we implicitly used here a bit more formally:

diff(sin(x), t) => diff(x, t) * cos(x)
diff(x*y, t)    => diff(x,t)*y + x*diff(y,t)
diff(x, x)      => 1
x*1             => x
1*x             => x

This clearly isn’t exhaustive, but by listing more and more rules, we can take symbolic derivatives on arbitrary terms.

Conveniently enough, there’s a general strategy we can use to check if the left-hand-side for a rule applies at a location of a concrete term: pattern matching.

Given an arbitrary symbolic term, we run through and find instantiations of a given left-hand side, and then replace each with the corresponding right-hand side until there are no more rules left to apply. This is precisely how symbolic differentiation in Rewrite.jl works.


As a quick side note, you may be disappointed with the fact that we have two rules for the multiplicative identity, x*1 => x and 1*x => x. Shouldn’t this symmetry be taken care of, since we know * is commutative? Yes, I’d say so. If you’re interested, I’ll be presenting this in more depth at JuliaCon in a couple weeks.


(Thanks for the ping, @ChrisRackauckas!)

3 Likes

Also, Yingbo and I were discussing how the symmetry of the Hessian would allow for a ton of CSE to occur if you created the symbolic function for Hessian*vector computations. I think that would destroy something like forward-over-reverse implementations of AD for Hessians because those can have a bit of overhead.

Oh BTW @pistacliffcho, we have implemented the matrix-free Hessian*vector computations over in SparseDiffTools.jl. You can choose from every reasonable mixture of numerical differentiation, ForwardDiff, and Zygote for the calculation.

I think current tests showed that numerical + Zygote was the current fastest, but you can time it yourself with this script:

using SparseDiffTools, BenchmarkTools
x = rand(300)
v = rand(300)
f(u) = sum(abs2,u)
du = similar(x)
c1 = similar(x); c2 = similar(x); c3 = similar(x); c4 = similar(x)
cache1 = ForwardDiff.Dual{SparseDiffTools.DeivVecTag}.(x, v)
cache2 = ForwardDiff.Dual{SparseDiffTools.DeivVecTag}.(x, v)
config = ForwardDiff.GradientConfig(f,x)
@btime num_hesvec!($du, $f, $x, $v, $c1, $c3, $c4)
@btime numauto_hesvec!($du, $f, $x, $v, $config, $c1, $c2)
@btime autonum_hesvec!($du, $f, $x, $v, $c1, $cache1, $cache2)
@btime numback_hesvec!(du, f, x, v, $c1, $c2)
@btime autoback_hesvec!(du, f, x, v, $cache1, $cache2)

But if you want to make it really fast, compute it all symbolically with ModelingToolkit and just compile the most optimized function. That will always be faster than autodiff because it will be <0 overhead, less than zero since it will allow for common subexpression elimination (CSE), meaning that if there are expressions in the symbolic version that are repeated they can be factored out and only computed once, something that is really hard to do if you just have a code version of a Hessian*vector computation.

2 Likes

Can you post some simple code that mimicks your problem?

So… ModelingToolkit.jl looks interesting. Some questions:

  • Can one define differentials in both time and space, and use it to describe PDE:s, which can then be automatically discretized (…in the future)?
  • How does it relate to an extension like Modia? [Which allows for building and re-using libraries.]
  • Most models that relate to “reality” are systems with inputs [control inputs, disturbances] and outputs [sensor signals, controlled variables]. ODE solvers, etc. often neglect such inputs and outputs; it is integral part of tools such as Modia. Any thoughts on this?
  • Symbolic differentiation… I used Mathematica to develop a symbolic model of a robot (Euler-Lagrange formalism) way back when Mathematica was relatively new. Then I generated FORTRAN code. The model became extremely complicated. My impression is that AD is more efficient. But maybe I misunderstand the difference (or the use?)?

The prototype works:

Finishing that in August.

ModelingToolkit is a compiler. Modia is a DSL. A user cannot define a new compilation process for how a model order reduction algorithm works in Modia, but they can in ModelingToolkit. My vision is to just take out the guts of Modia and replace them with ModelingToolkit, i.e. get the language of Modia but with internal computations using the common ModelingToolkit system which is now being used in a bunch of other DSLs.

That’s for the modeling language to decide. ModelingToolkit isn’t a language, it’s an IR.

AD is not more efficient, it has overhead. Symbolic reduces or outright removes that overhead, if it’s applicable. In some cases the full symbolic expression may be too large to be represented, in which case AD is “better” because it’s actually possible. But ModelingToolkit isn’t a language, it’s an IR, so you can define the derivative of some function to use AD and it will generate code with that AD in there.

2 Likes