How to deal with sparse Jacobians?

I believe I’ve been solving large and sparse linear systems most of my career, both as the end solution of a problem or in the calculation of non-linear least-squares iterations. I imagine a lot of people share my experience, and I am looking for suggestions for improving my work, and potentially do some collaboration developing tools for this kind of problem.

The greatest techniques I have learned in the past decades to deal with such problems must have been to use PCG, sparse matrices, and AD. Now all of those seem to be already well-supported in Julia, and I’m having a great time using the available tools. I may be just overlooking some project, but there seems to be something missing to help you actually deal with large models that result in large but sparse Jacobian matrices.

I imagine most people solving sparse systems are working with FEM. Myself, I work mostly with mobile robotics and problems such as SLAM and SFM and large regularization problems.

In robotics, we have seen the development of tools such as g2o, that allow you to represent your problem as a graph from which a gradient and Hessian matrix can be calculated. Some people have been giving this the special name of graph optimization even though in my perhaps limited understanding it is “nothing but” a way to help you assemble that sparse Jacobian. And I am not sure building such a graph is the optimal way of obtaining the Jacobian-updating-procedure that is what you are really interested in having.

There seems to be a nice group of people doing robotics in Julia. Is RoME.jl the project I should look for to help me? Is JuMP adequate for such a problem?

One of the reasons I wanted to open the question to the general community is because I am curious to hear what people from other areas have to say about these problems. AD is great, but there is a lot of extra work necessary to update a sparse Jacobian, and do it optimally. What do people in FEM usually do, for instance?

Also, I’ve been only using ForwardDiff, but there’s a lot of talking out there about other approaches to AD, not only alternatives to dual numbers (like Cassette) but also alternatives to forward-mode AD. I am curious to hear, what possibilities could I try to explore in this kind of sparse-Jacobian optimization problems?

How could do we best go from an abstract, declarative description of a problem with a model that has “N” control points and “M” observations, and come up with an efficient procedure to update a sparse Jacobian matrix? Is this not perhaps a great opportunity for metaprogramming?

2 Likes

Nope, we’re missing some matrix coloring schemes for efficiently numerically/ADing sparse Jacobians. Basically, these are the algorithms that help divy up the work given some known sparsity pattern. Then whatever domain-specific framework (like an FEM package), just needs to know how to generate the sparsity pattern as an input and the rest is covered by the coloring algorithm. Some discussion:

3 Likes

I must thank you for answering my most interesting question: what is out there used in other areas that are relevant to this problem. I must confess my ignorance of this technique of Jacobian compression. It is not surprising, though, given that not even preconditioning is widely used in my area. I would love if you could point me out some basic references about this technique.

I did try to follow a couple of articles I found, and I have a couple of questions. First of all, wouldn’t this be just a performance enhancement on top of using sparse matrices that might not even be equally drastic? I mean, using SparseArrayCSC instead of a dense matrix is a make-it-or-break-it thing for me. Performance improvements are always nice, but it looks to me like this may not be something fundamental for my applications. I am looking mostly for something that will give me convenience handling large models. To use Jacobian compression, or even a sparse array, to begin with, might be perhaps just options in a kind of “back-end” that computes the Jacobian…

And second: would I be wrong to say that this graph-coloring application is only required to support general models, but domains such as FEM and bundle-adjustment have a specific nature where this problem can be solved in a simple way, if not even trivial?

In my problems I have, for instance, 6 * N parameters and 6 * M observations. The Jacobian is calculated in blocks of 24x6 parameters-by-observations, with only 24 values per row. N=100 and M=1000. My current problem is just that writing the for-loops that produce these small Jacobian blocks is a drag, and hard to adapt to new models. Calculating all those indices is making me crazy! Also writing very high-level code with SparseArrayCSC seems to easily result in performance losses. Not to mention that for-loops are the new go-to, we should get rid of them.

I can be easily convinced there must be a better way to write this kind of sparse Jacobian update application, but I can’t see how the graph coloring algorithm would be necessary in my case. What exactly is a project I could contribute to in order to make this kind of thing easier for everybody?

Yes, it’s just a performance enhancement and won’t be as drastic as dense → sparse. But it can be a pretty big performance enhancement if your previous seeding techniques were naive (sounds like they weren’t entirely naive). In fact, a graph coloring algorithm will tell you the optimal number of computations required for computing the Jacobian. You can’t get better than optimal :wink:.

The purpose of a graph coloring algorithm is that you supply it a sparsity pattern and it comes with with the strategy. the 24x6 strategy that you have: you have to know/prove it works, which can be difficult if your sparsity pattern is irregular. The graph coloring algorithm realizes that you don’t have to + epsilon one at a time, and can instead + [epsilon,0,0,epsilon,...] (either numerical diff or partials of a dual number) as long as the parts of the Jacobian are disconnected. i.e., if it came from FEM, they have to be not touching, so the minimal number of non-touching colors is the minimal number of function calculations (+1) to calculate the full Jacobian. For things like polynomial elements, you can get cases with like <10 seeds necessary to populate a 1000x1000 Jacobian if there is a strong pattern to it, and the coloring algorithm find that itself.

FEM can have general sparsity patterns. In fact, there are fully connected Galerkin expansions as well. Any pattern can come out! So yes, if you have a banded matrix, it’s easy to figure this out yourself. Block banded? It’s easy to come up with some good (but maybe not optimal) schemes. Some random sparse matrix? You might need an algorithm.

Don’t use the indexing. Use the (I,J,V) representation directly when you can. Again, a coloring algorithm works abstractly on the (i,j) sparsity pattern and so it can easily build seeds on the linear representation.

Note that a completely other direction to go is matrix structures. For example, banded and block banded matrices have extra structure that make them not just a sparse matrix. Using BandedMatrices.jl and BlockBandedMatrices.jl can speed up calculations on them. Finding structures and algorithms to exploit them is worthwhile given Julia’s multiple dispatch architecture.

3 Likes

With really large-scale problems (typically PDE-constrained optimization, in my case), in any language, I usually don’t employ automatic differentiation, to be honest. There is often too much special structure I want to take advantage of in a really large problem for AD tools to reliably handle. It’s not all that hard to work out adjoint-method differentiation yourself. (Or you can use a Jacobian-free method like Anderson-accelerated fixed-point iterations.)

2 Likes

Regarding large scale problems, for a typical FEM (Finite Element Modeling) setting, you do the AD on the element level where the result is a dense local sensitivity matrix that gets assembled into a large sparse matrix. The local matrix is typically of a reasonable size (3x3 - 30x30) and forward mode AD has in my experience been very efficient.

3 Likes

Optimal in what sense? Figuring out the optimal type of AD to use to evaluate the derivative of a function is NP [1]. The distance-k graph coloring problem (which comes up often for FEM Jacobians) is also NP hard [2].

In my benchmarks (2D system of PDEs with 4 equations), computing the Jacobian via coloring is 50-100x slower than explicitly computing it. Coloring is definitely advantageous over computing one column at a time, and (after the coloring algorithm is debugged) requires no additional human effort after the residual evaluation has been written, but it doesn’t perform nearly as well as explicitly computing the Jacobian.

[1] Griewank and Walther, Evaluating Derivatives, 2008
[2] Gebremedhin et al. What Color is Your Jacobian? Graph Coloring for Computing Derivatives, 2005

1 Like

I invariably need reverse-mode/adjoint differentiation for my problems. For adjoint differentiation of PDEs with FEM or other methods, often you want to apply the adjoint analysis to the PDE (or weak form) before discretization, rather than differentiating the discretized problem. For FEniCS, there is a system called dolfin-adjoint that does this automatically to the symbolic weak form. See also this presentation.

Optimal in the sense that I wrote down: it finds the minimal number of seeds required to perform numerical or automatic differentiation, and thus the minimal number of function evaluations. That said, yes the graph coloring algorithm can be expensive, which is why approximate colors are used in practice instead of attempting to find the optimal ones. But if you did want to find out exactly how many function evaluations you need to populate a sparse Jacobian, there is a number you can get for it.

There’s also a trade off here. The Jacobian may need to be populated hundreds of times during the solution of an ODE (for MOL integration), so a slow graph coloring computation in order to speed up the Jacobian calculations can be beneficial.

Well yes, of course. I would never argue that a numerical method for computing the Jacobian is better than writing down the analytical solution to it… but if we’re already talking about AD I would think that the analytical Jacobian isn’t being considered here. So one way to say the statement is, if you had to use Dual numbers, what is the minimal dimension (number of seeds) can you use to get the Jacobian? That’s what’s answered by coloring algorithms.

That sounds like it would very naturally make use of the sparsity to a good degree, but is FEM specific. That’s a nice trick!

The adjoint method has a vector Jacobian product in there though, so unless the adjoint equations are built symbolically (or by hand) then there’s a sparse Jacobian to handle in there. It looks like dolfin-adjoint is handling it by symbolically building the adjoint equation and then directly solving the resulting PDE?

1 Like

I don’t know of any algorithm that can compute the minimum number of seeds. The problem is not that graph coloring is expensive, its that no algorithms exist to compute the result in polynomial time.

I think this is the main point of confusion. Your initial post said:

but you are not intending to find the optimal coloring, right?

Yes. That statement is correct: an optimal coloring will tell you the minimal number of computations to do and how to do it. The question it was answering was whether using a coloring algorithm was just a performance enhancement. The answer is that not only is a it performance enhancement, coloring algorithms can tell you exactly how much of enhancement you can get if you want to numerically/ADly build the sparse Jacobian.

None of the statements in this thread are contradictory, just keep them clearly delineated:

  • If you wanted to find out the minimal number of computations required to calculate a sparse matrix via numerical differentiation (or the minimal Dual number dimension for AD), an optimal graph coloring will give you that.
  • Optimal graph colorings are expensive to calculate, so in practice fast sparse Jacobian calculation algorithms utilize an approximate graph coloring, reducing the number of required function calls but not necessarily to the minimum (but with a much faster calculation of the coloring partition).
  • However, in any case doing this is more expensive than the analytical Jacobian, so you shouldn’t do any of this if you have access to an expression for the Jacobian. This should only be done when you do not have that.
  • In specific applications you know more information than just “here’s f(u), find df/du”. Like in FEM, you can AD the matrix assembly code since local assembly is only applied to adjoining elements, meaning this naturally uses the sparsity of the discretization.
  • In other applications you can remove the need to calculate sparse Jacobians at all. If you have a PDE you want derivatives of, you can write out the adjoint equations. If you are doing integration with an implicit method (or just solving a nonlinear system), the implicit equations can make use of something like Anderson Acceleration so that way Jacobians (and Jacobian inversions) are not necessary.

Cool, I think I definitely want to check out block matrices.

I hope I can reach a better understanding of the relationship between the specific problems I’m working on and other applications with sparse Jacobians. I feel like I’m about to do a lot of work that might go into a library that could help other people. But if the next steps in the evolution of these derivative tools go through these coloring techniques, then I have a lot of reading to do before I’m able to contribute, and I’ll probably just stick to writing my own for-loops instead.

This definitely sounds a lot like my case. The smaller matrix I can even compute using just static arrays. The problem I’m interested in is at a higher level: how do you assemble them into the full matrix? It’s really more about the coding than the math. Do you find yourself often writing for loops that encode information about which indices represent each cell, for instance? Or are there FEM frameworks that help you with that?

In my problem I must go from a group of vectors p_n, and apply a bunch of (vector) functions resulting in something like

r_n = h(g(f(p_{n+1})), f(p_{n+2})) - h(g(f(p_n)), f(p_{n+1}))

That means if you just calculate the r_n straight from the p_{n:n+2} there will be a lot of repeated calculations. This can be avoided by storing some state in the for-loop that goes through all of these functions. I am looking for techniques that can help me not only just assembling that matrix and also take care of boundaries, but also hopefully do something smart about those repeated calculations.

I’m actually using AD mostly for convenience, and the promise that it often results in calculations that are both faster and more accurate. How much of that is true, I can’t tell yet. But in my applications, I might indeed implement an analytical Jacobian instead. I’m just too lazy to work the equations out, and I’m not even sure I already got the right ones already!