Optim; Hessian or Hessian free?

krenova · July 10, 2019, 5:24am

Hi,

In the documentation for the Optim package I note the following statement with regards to the choice of algorithm:

When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred.

Just wondering if anyone is able to quantify a little more (at least approximately), the circumstances when the dimensions is considered large? I’m solving a concave optimization problem (Probit regression model in the log domain) and currently using the Hessian Free LBFGS algorithm. Currently working on dimensions of 100 but hoping to scale it up 1000, 5000 or more depending on the optimization performance.

Thanks for any inputs

Tamas_Papp · July 10, 2019, 5:39am

The nice modular API of Optim.jl should allow you to simply try out (and benchmark) both options at practically zero cost.

krenova · July 10, 2019, 5:54am

Thanks @Tamas_Papp, yes you are right. Will have to do that at some point of time. But was wondering if there are rule of thumbs that we can follow when prototyping instead of having to derive and code the Hessian.

Any thoughts?

longemen3000 · July 10, 2019, 6:01am

you can use Forward Differenciation, via forwardDiff.jl, to calculate the hessian and the gradient

Tamas_Papp · July 10, 2019, 6:27am

If you do not have an explicit Hessian, it is usually not worth coding one. Quasi-Newton methods don’t need one anyway and are rather efficient in real-life problems. Theory characterizes local convergence (quadratic vs superlinear), but that is not the pimary concern in practice unless your problem is super well-behaved.

Generally, the curvature variation in your problem is at least as important as the dimension, if not more. A simple probit model in 10³–10⁴ dimensions should be manageable with either BFGS or even CG.

ChrisRackauckas · July 10, 2019, 10:31am

You can just run it through ModelingToolkit.jl to automatically generate a symbolic version of your numerical function and then just compute the Hessian on that.

pistacliffcho · July 11, 2019, 5:52pm

Just curious: it sounds like ModelingToolkit.jl providing the same functionality as ForwardDiff.jl when it comes to auto-differentation, is that correct?

I really like the interface for ForwardDiff.jl, but it’s my (poorly informed) understanding that forward differences are not very computationally efficient (assuming I read the documentation correctly) for the common stats problem of optimizing R^p \rightarrow R^1, which is a bit discouraging. Should we expect better performance out of ModelingToolkit.jl?

pistacliffcho · July 11, 2019, 5:58pm

Small note: when optimizing high-dimensional problems, it is advised to use L-BFGS rather than BFGS. This is because BFGS saves an approximation of the Hessian, which becomes very large and slow to work with in high dimensions. L-BFGS saves only a low-rank approximation of the inverse of the Hessian, which both saves memory and computations.

ChrisRackauckas · July 11, 2019, 6:01pm

No, ModelingToolkit.jl isn’t doing automatic differentiation, it is doing symbolic differentiation. It’s weird, because you never wrote the code in the symbolic framework, but what it’s doing is automatically converting numerical Julia code into symbolic Julia code through AST desugaring and then just simply using it like it was in a CAS. It’s best shown by an example:

http://juliadiffeq.org/DiffEqTutorials.jl/html/ode_extras/01-ModelingToolkit.html

This is definitely one of those “only in Julia” things

pistacliffcho · July 11, 2019, 6:18pm

Oh interesting. If I understand what you mean by that, it’s reading your code, translating it into something that a tool like Mathematica/Wolfram Alpha can read as an equation and then having the Mathematica-like tool take the derivative of that. Is that roughly correct?

I don’t really know how the internals of symbolic differentiation works, but my first guess would be that it does something like apply the chain rule a whole lot, which I would think is functionally very similar to auto-differentiation (although all these topics I only vaguely understand from the outside). If I’m wrong about that, then the real answer is “you just need to learn more about CAS systems” and I assume you’re not going to put up a full intro to CAS in a Julia Discourse thread

ChrisRackauckas · July 11, 2019, 6:30pm

@HarrisonGrodin is optimizing the crap out of this, and it’ll be a nice paper some day.

HarrisonGrodin · July 11, 2019, 7:07pm

Symbolic differentiation is actually quite straightforward, once you see the trick; in fact, you can think of it as a special case of symbolic simplification.

Suppose you want to take the derivative of sin(x*x) with respect to x, which I’ll denote diff(sin(x*x), x). How would you start? The chain rule - we now have diff(x*x, x) * cos(x*x), since we know that \frac{d \sin y}{dy} := \cos y. Now, let’s take out diff(x*x, x) - treating * as a binary function and applying the chain rule, we get x*diff(x,x) + diff(x,x)*x. Finally, we take out each diff(x,x), replacing with 1. Our result now looks something like (x*1 + 1*x) * cos(x*x). There’s a bit more simplification to do here - removing the unnecessary *1s, we get (x + x) * cos(x*x).

How do we automate this, though? The answer is surprisingly straightforward; we just encode the rules we implicitly used here a bit more formally:

diff(sin(x), t) => diff(x, t) * cos(x)
diff(x*y, t)    => diff(x,t)*y + x*diff(y,t)
diff(x, x)      => 1
x*1             => x
1*x             => x

This clearly isn’t exhaustive, but by listing more and more rules, we can take symbolic derivatives on arbitrary terms.

Conveniently enough, there’s a general strategy we can use to check if the left-hand-side for a rule applies at a location of a concrete term: pattern matching.

Given an arbitrary symbolic term, we run through and find instantiations of a given left-hand side, and then replace each with the corresponding right-hand side until there are no more rules left to apply. This is precisely how symbolic differentiation in Rewrite.jl works.

As a quick side note, you may be disappointed with the fact that we have two rules for the multiplicative identity, x*1 => x and 1*x => x. Shouldn’t this symmetry be taken care of, since we know * is commutative? Yes, I’d say so. If you’re interested, I’ll be presenting this in more depth at JuliaCon in a couple weeks.

(Thanks for the ping, @ChrisRackauckas!)

ChrisRackauckas · July 11, 2019, 7:14pm

Also, Yingbo and I were discussing how the symmetry of the Hessian would allow for a ton of CSE to occur if you created the symbolic function for Hessian*vector computations. I think that would destroy something like forward-over-reverse implementations of AD for Hessians because those can have a bit of overhead.

Oh BTW @pistacliffcho, we have implemented the matrix-free Hessian*vector computations over in SparseDiffTools.jl. You can choose from every reasonable mixture of numerical differentiation, ForwardDiff, and Zygote for the calculation.

I think current tests showed that numerical + Zygote was the current fastest, but you can time it yourself with this script:

using SparseDiffTools, BenchmarkTools
x = rand(300)
v = rand(300)
f(u) = sum(abs2,u)
du = similar(x)
c1 = similar(x); c2 = similar(x); c3 = similar(x); c4 = similar(x)
cache1 = ForwardDiff.Dual{SparseDiffTools.DeivVecTag}.(x, v)
cache2 = ForwardDiff.Dual{SparseDiffTools.DeivVecTag}.(x, v)
config = ForwardDiff.GradientConfig(f,x)
@btime num_hesvec!($du, $f, $x, $v, $c1, $c3, $c4)
@btime numauto_hesvec!($du, $f, $x, $v, $config, $c1, $c2)
@btime autonum_hesvec!($du, $f, $x, $v, $c1, $cache1, $cache2)
@btime numback_hesvec!(du, f, x, v, $c1, $c2)
@btime autoback_hesvec!(du, f, x, v, $cache1, $cache2)

But if you want to make it really fast, compute it all symbolically with ModelingToolkit and just compile the most optimized function. That will always be faster than autodiff because it will be <0 overhead, less than zero since it will allow for common subexpression elimination (CSE), meaning that if there are expressions in the symbolic version that are repeated they can be factored out and only computed once, something that is really hard to do if you just have a code version of a Hessian*vector computation.

pkofod · July 11, 2019, 8:24pm

Can you post some simple code that mimicks your problem?

BLI · July 11, 2019, 8:36pm

So… ModelingToolkit.jl looks interesting. Some questions:

Can one define differentials in both time and space, and use it to describe PDE:s, which can then be automatically discretized (…in the future)?
How does it relate to an extension like Modia? [Which allows for building and re-using libraries.]
Most models that relate to “reality” are systems with inputs [control inputs, disturbances] and outputs [sensor signals, controlled variables]. ODE solvers, etc. often neglect such inputs and outputs; it is integral part of tools such as Modia. Any thoughts on this?
Symbolic differentiation… I used Mathematica to develop a symbolic model of a robot (Euler-Lagrange formalism) way back when Mathematica was relatively new. Then I generated FORTRAN code. The model became extremely complicated. My impression is that AD is more efficient. But maybe I misunderstand the difference (or the use?)?

ChrisRackauckas · July 11, 2019, 9:51pm

The prototype works:

github.com/SciML/DifferentialEquations.jl

Automated PDE solving fit for DifferentialEquations.jl

opened 07:36AM - 17 Jun 19 UTC

closed 05:53AM - 02 Sep 20 UTC

ChrisRackauckas

Building off of the story in https://github.com/JuliaDiffEq/DifferentialEquation…s.jl/issues/260 , DiffEqOperators getting auto finite difference discretizations likely done this summer. Thus the next step is automating the PDE discretization from a high level description. Since ModelingToolkit.jl is well-developed, it makes sense to utilize that to give the high level description. Thus I put a prototype together: ```julia using ModelingToolkit, DiffEqOperators, DiffEqBase, LinearAlgebra # Define some variables @parameters t x @variables u(..) @derivatives Dt'~t @derivatives Dxx''~x eq = Dt(u(t,x)) ~ Dxx(u(t,x)) bcs = [u(0,x) ~ - x * (x-1) * sin(x), u(t,0) ~ 0, u(t,1) ~ 0] abstract type AbstractDomain{T,N} end struct IntervalDomain{T} <: AbstractDomain{T,1} lower::T upper::T end struct VarDomainPairing variables domain::AbstractDomain end Base.:∈(variable::ModelingToolkit.Operation,domain::AbstractDomain) = VarDomainPairing(variable,domain) Base.:∈(variables::NTuple{N,ModelingToolkit.Operation},domain::AbstractDomain) where N = VarDomainPairing(variables,domain) domains = [t ∈ IntervalDomain(0.0,1.0), x ∈ IntervalDomain(0.0,1.0)] ######################################### # Note: Variables can be multidimensional struct ProductDomain{D,T,N} <: AbstractDomain{T,N} domains::D end ⊗(args::AbstractDomain{T}...) where T = ProductDomain{typeof(args),T,length(args)}(args) @parameters z z ∈ (IntervalDomain(0.0,1.0) ⊗ IntervalDomain(0.0,1.0)) @register Base.getindex(x,i) Base.getindex(x::Operation,i::Int64) = Operation(getindex,[x,i]) z[1]^2 + z[2]^2 ~ 1 z[1]^2 + z[2]^2 < 1 struct CircleDomain <: AbstractDomain{Float64,2} polar::Bool CircleDomain(polar=false) = new(polar) end z ∈ CircleDomain() @parameters y (x,y) ∈ CircleDomain() @parameters r θ (r,θ) ∈ CircleDomain(true) # Use constrained equations for more detailed BCs struct ConstrainedEquation constraints eq end ConstrainedEquation([x ~ 0,y < 1/2], u(t,x,y) ~ x + y^2) ######################################### struct PDESystem <: ModelingToolkit.AbstractSystem eq bcs domain indvars depvars end pdesys = PDESystem(eq,bcs,domains,[t,x],[u]) abstract type AbstractDiscretization end struct MOLFiniteDifference{T} <: AbstractDiscretization dxs::T order::Int end MOLFiniteDifference(args...;order=2) = MOLFiniteDifference(args,order) discretization = MOLFiniteDifference(0.1) struct PDEProblem{P,E,S} <: DiffEqBase.DEProblem prob::P extrapolation::E space::S end function Base.show(io::IO, A::PDEProblem) println(io,summary(A.prob)) println(io) end Base.summary(prob::PDEProblem) = string(DiffEqBase.TYPE_COLOR, nameof(typeof(prob)), DiffEqBase.NO_COLOR) function discretize(pdesys, discretization::MOLFiniteDifference) tdomain = pdesys.domain[1].domain domain = pdesys.domain[2].domain @assert domain isa IntervalDomain len = domain.upper - domain.lower dx = discretization.dxs[1] interior = domain.lower+dx:dx:domain.upper-dx X = domain.lower:dx:domain.upper L = CenteredDifference(2,2,dx,Int(len/dx)-2) Q = DiffEqOperators.DirichletBC([0.0,0.0],[1.0,1.0]) function f(du,u,p,t) mul!(du,L,Array(Q*u)) end u0 = @. - interior * (interior - 1) * sin(interior) PDEProblem(ODEProblem(f,u0,(tdomain.lower,tdomain.upper),nothing),Q,X) end prob = discretize(pdesys,discretization) function DiffEqBase.solve(prob::PDEProblem,alg::DiffEqBase.DEAlgorithm,args...; kwargs...) solve(prob.prob,alg,args...;kwargs...) end using OrdinaryDiffEq sol = solve(prob,Tsit5(),saveat=0.1) using Plots plot(prob.space,Array(prob.extrapolation*sol[1])) plot!(prob.space,Array(prob.extrapolation*sol[2])) plot!(prob.space,Array(prob.extrapolation*sol[3])) plot!(prob.space,Array(prob.extrapolation*sol[4])) ``` ### Description of the code The equation and boundary conditions are just given by ModelingToolkit.jl operations. The ConstrainedEquation allows for specifying equations only on specific parts. Then each independent variable needs a domain. Domains can be multi-dimensional, and variables can be multi-dimensional, so this lets us smartly define variables in domains (and in manifolds!) and equations using them. The totality of the PDE is in the PDESystem, just a ModelingToolkit.AbstractSystem object. The workflow for solving a PDE is thus: 1. Define a PDESystem. 2. Discretize to a PDEProblem with an AbstractDiscretization. This step will kick out things like a PDEProblem that wraps an ODEProblem, or SDEProblem, or even LinearProblem and NonlinearProblem. 3. Solve the PDEProblem. So (1) is all ModelingToolkit.jl. (2) is just a `discretize` function and some problem types. We will need to add LinearProblem and NonlinearProblem with some solve dispatches for this. After (3), we will want to wrap the solution so it can get a nicer plot recipe. Thus the steps are: - [x] Add domains and VarDomainPairing to ModelingToolkit - [x] Add ConstrainedEquation to ModelingToolkit - [x] Implement PDESystem in ModelingToolkit - [x] Implement LinearProblem and NonlinearProblem in DiffEqBase - [ ] Implement some basic DEAlgorithms for it. Solve dispatches on DEAlgorithms that just do `\`, GMRES (`LinSolveGMRES()`), `nlsolve` - [x] Implement PDEProblem, PDETimeseriesSolution, PDENoTimeseriesSolution - [x] Create `discretize` stub function in DiffEqBase - [x] Implement a smart `discretize` function in DiffEqOperators.jl for automatic finite difference discretizations. This is where a ton of work will have to go, since it will need to parse the equation and find out what `CenteredDifference` and `UpwindDifference` operators to create. This work will continue after this closes.

Finishing that in August.

ModelingToolkit is a compiler. Modia is a DSL. A user cannot define a new compilation process for how a model order reduction algorithm works in Modia, but they can in ModelingToolkit. My vision is to just take out the guts of Modia and replace them with ModelingToolkit, i.e. get the language of Modia but with internal computations using the common ModelingToolkit system which is now being used in a bunch of other DSLs.

That’s for the modeling language to decide. ModelingToolkit isn’t a language, it’s an IR.

AD is not more efficient, it has overhead. Symbolic reduces or outright removes that overhead, if it’s applicable. In some cases the full symbolic expression may be too large to be represented, in which case AD is “better” because it’s actually possible. But ModelingToolkit isn’t a language, it’s an IR, so you can define the derivative of some function to use AD and it will generate code with that AD in there.

Topic		Replies	Views
Evaluating the Hessian of the Lagrangian efficiently Performance modelingtoolkit	14	1979	January 27, 2022
What are the advantages of using ModelingToolkit to make and solve ODEProblems? Modelling & Simulations question , ode , modelingtoolkit	7	2009	September 6, 2023
Highly Asymmetric Hessian after optimization Optimization (Mathematical)	26	1583	July 6, 2019
Using a sparse Hessian in Optim.jl Optimization (Mathematical)	21	206	October 29, 2024
Odd warning and issues with optimization only when inside Optimization.jl. Zygote.hessian works fine Machine Learning question , optimization , zygote	8	123	October 30, 2024

Optim; Hessian or Hessian free?

Related topics