No. Stencil compilers definitely make use of sparsity. What they are really doing is specializing on iteration to take into account precaching behavior. If you’re not familiar with BLAS internals, see things like:
https://nbviewer.jupyter.org/github/mitmath/18335/blob/master/notes/Memory-and-Matrices.ipynb
So it matters a lot in matrix multiplication. Stencil compilers can do similar tricks:
https://dspace.mit.edu/handle/1721.1/72122
However, notice that all of this is on simple trivial linear operations: if you are only doing *
and +
, and you’re only doing a single one of them, then this can matter a lot. So this shows up in linear PDEs because they boil down to A*x
for some stencil A
.
But when you have a nonlinear equation you have A*x .+ f.(x)
, things change. If you have a single exp
or something in f
then f
will already dominate the cost and so you’re wasting your time optimizing the *
and +
part. Remember, even floating point division is costly in comparison, so if /
you’re already in the zone where optimizing the stencil part isn’t your biggest worry. For reaction-diffusion, this means any case with Michaelis-Mentons or Hill functions.
So assume f
is “trivial enough” that you still want to optimize the stencil part all of the way (f(x) = -x^2
for example). There are two ways to approach this:
- Stencil compilers are fast, so calculate
y=A*x
then do y .+ f.(x)
- Write it out in dot products and go linearly down the
x
In the case of one semilinear PDE it really comes down to implementation as to which will be faster. But now what about when you have systems of semilinear PDEs, like in reaction-diffusion? Now you have something like: A*x + f.(x[:,:,1],x[:,:,2])
as your standard mathematical operation. You need two very different points of the array at all times in every line, so some of the assumptions of the cache oblivious algorithms are now broken: if you have a system of n
dependent variables, the precache each of them could have is now 1/n
the size of what you had before, and very quickly the loop size for L1 or L2 cache start to drop off. So when you have big enough systems of PDEs, or if f
is costly enough in any of them, the stencil compiler is not the big savior you’d hope it to be.
So then at the end of the day, stencil compilers are amazing tools, but in so many of these different edge cases the standard case that they optimize becomes rounding error, and sometimes just fusing f
into the other part of the loop just becomes a better idea.
We show an example in the tutorials of devectorizing for manual fusing:
https://tutorials.sciml.ai/html/introduction/03-optimizing_diffeq_code.html
We’re playing around with ideas for super SLP-vectorization and using reinforcement learning to reorder equations to better schedules on the fused code, and also parallelizing on the fused code (like in Automated Sparse Parallelism of Julia Functions via Tracing · Symbolics.jl), but this is all requiring the symbolic analysis of Symbolics.jl. So this is what leads to the direction to then perform the lowering in this way.
There’s still lots of research to do here, and some of this is based off of quick calculations that probably should get benchmarked in more detail, but hopefully that gives the full picture as to why months later we’ve switched away worrying about stencil compilers to worrying about symbolic computing.