Feasibility of nested forward diff with sparse partials

Philippe_Maincon1 · August 25, 2022, 5:36am

Hi,

I will first describe a use-case I have, the issues I have with existing tools, propose a solution to the problem, and finally ask my questions - to people with a deep knowledge of Julia - whether my proposed solution might fly. Here goes.

Use-case
I am currently working on a finite element (FE) framework, specially for optimisation problems: I want to optimise some cost (or maximize a probability), under the constraint that the FE model be exactly verified. “Framework” here means that I need to think very carefully so that I make the work of adding elements to this system as simple as possible. After much consideration, I arrived at a simple solution: the element programmer must provide a function lagrangian that takes in O(10)-O(100) arguments, and returns a scalar (an additive contribution to the Lagrangian of the constrained optimisation problem).

Hence my requirement: I need to compute the Hessian of a scalar-valued function of (typicaly) 10 to 100 inputs (degrees of freedom) , and this needs to be very fast.

Issues with existing tools
Both ForwardDiff and Zygote (reverse diff++) provide the functionality needed. However, forward automatic differentiation, nested to the second order (Hessian), operating on a scalar-valued function, will have low performance: it lugs around a large number of zeros. I have been testing Zygote, and there it seems that the overheads in the computations become large related to the relative small numbers of inputs (Zygote and StaticArrays). I suspect Zygote is designed for considerably larger numbers of input variables.

Proposed solution
I think in terms of number of flops, we could achieve very decent performance using forward differentiation, nested to the second order, with sparse partials. Forward differentiation, to be performant, must not allocate: the partials must be stored on the stack. In Julia, only [immutable, not a problem] variables of known size can be stored on the stack. As a consequence, a sparse partial would need to have its sparsity structure as a part of the type. As a further consequence, “each” number within lagrangian would have a different type, so that arrays of such numbers would be heterogenous - and allocating on he stack is still forbidden.

Questions
I have long ago programmed my own nested forward differentiation, which does essentially the same as ForwardDiff.jl, so I am pretty sure I would know how to create sparse partials and fit then into forward diff.

partials need to be able to be multiply by scalars (I am deliberately not saying of what type!), and added with each other. My intention is that the compiler must create a specialised method instance for each combination of sparsity structure in these operations. Clearly this is going to result in large compilation times. Let’s say that I accept that.

Will the compiler indeed create specialised method instances, or will it fall back on creating a smaller number of type-unstable instances (inacceptable)?
Will the specialised method instances be efficient - or will the sparse operations sabotage things I know little about like pipelining/vectorisation/BLAS?
Will the method instances be statically linked into the code or will there be some longish table lookup at run time to decide which instance to call?

But that’s just the beginning:
Now each differentiated number has its own data type (because its first and second order partials have their own sparsity pattern). For FE, my programming paradigm has been to use linear algebra operations on StaticArrays (neat code, keeps you on he stack). Obviously, StaticArrays does not handle arrays with elements of different types, so a new type of arrays must be created, with questions 1), 2), 3) just as relevant. At a minimum, I want to be able to do additions, and tensor operations on such arrays.

Does this imply a huge job (rewriting the whole of StaticArrays, or is there some minimum interface to be implemented, which would give access to already developed functionality - and result in efficient compiled code?.

If this does not scare you, you do not understand what I am proposing to do .

baggepinnen · August 25, 2022, 8:23am

There is some support for this in GitHub - JuliaDiff/SparseDiffTools.jl: Fast jacobian computation through sparsity exploitation and matrix coloring
but for Hessians, finite differences are used for the outer differentiation. Maybe you could build upon the tools there by providing sparse ForwardDiff over ForwardDiff.

Are the gradients sparse BTW? If so, would it be more efficient/easier to calculate the Hessian of the entire lagrangian directly instead of summing Hessians of individual terms? That way you’d have a dense gradient and would not have to worry about sparsity for the inner gradient calculation, only the outer Jacobian of the gradient would be sparse.

Philippe_Maincon1 · August 25, 2022, 8:51am

Hi,

I will certainly study SparseDiffTools.jl, thank you for the tip-off. The first thing I will look for reading about it is “are we on the stack?”.

My gradients are full. This is true whether you consider a single element (10-100 degrees of freedom) or the whole system (1e3-1e6 degrees of freedom). As for the Hessian, for the elements, they are “somewhat sparse”, and for the model “very sparse”.

You mention finite differences. I’m not a fan: for multiphysical models with variables of different kinds, just fining a good increment is a challenge. That said, I will look up FiniteDiff.jl and FiniteDifferences.jl (mentonned in SparseDiffTools.jl) to see how they work.

baggepinnen · August 25, 2022, 9:33am

If the gradient is sparse, how will any gradient-based solver ever change the parameters corresponding to zero gradient?

Philippe_Maincon1 · August 25, 2022, 9:39am

I am indeed beginning to see a possible solution:
Instead of having automatic differentiation operate at the element level, as has been my default thinking so far, I operate at the model level.

I use reverse autodiff (possibly Zygote) to get a gradient. I found Zygote inefficient at element level (10-100 variables), that need not be true at model level 1e3-1e6 variables)
I use matrix coloring (I never knew that existed!) and forward diff to obtain the Hessian, with only 10-100 partials needed.
I can skip the “joys” of programming the whole sparse matrix assembly!

Give me a few days…

Philippe_Maincon1 · August 25, 2022, 9:40am

Sorry, I now corrected my typo: gradients are full.

ChrisRackauckas · August 25, 2022, 11:24am

That would only take an hour to fix if someone really cared. And that’s only one method call.

github.com

JuliaDiff/SparseDiffTools.jl/blob/master/src/differentiation/compute_hessian_ad.jl#L55-L57


      
          hes_cache.grad!(hes_cache.G1, x, hes_cache.grad_config)
          hes_cache.buffer[:, j] .= (hes_cache.G2 .- hes_cache.G1) ./ 2ϵ
          x .+= ϵ .* @view hes_cache.D[:, j] #reset to original value

Just change that to have the same dual handling as https://github.com/JuliaDiff/SparseDiffTools.jl/blob/master/src/differentiation/compute_jacobian_ad.jl

The first version of autoauto_color_hessian can be done single chunk pretty quickly.

Topic		Replies	Views
Zygote vs. Forward Diff with Optim Machine Learning performance , optim , zygote , forwarddiff	4	611	March 24, 2023
How do you speed up the linear sparse solver in Zygote? Numerics question , zygote , linearsolve , sparsearrays	44	1265	March 24, 2024
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1767	April 21, 2022
Newbie: Gradient of a gradient performance in Zygote General Usage zygote	2	511	March 21, 2021
Sparse Gradient using Zygote & Optim Optimization (Mathematical)	2	775	November 11, 2020

Feasibility of nested forward diff with sparse partials

Related topics