Compiler specialisation in automatic differentiation

I am working with finite element method code. In this context, it is (very!!!) useful to use forward automatic differentiation to differentiate the function that, from degrees of freedom, compute the element’s contribution to the residual. Anyway: a function of the form

r = foo(x)

where x, r and everything in between are static arrays.

I started early: ForwardDiff.jl did no exist yet, so I have my homebrew which does essentially the same - I’ll refer to it as Adiff.

At the start of a differentiation process, the partials of a dual number are populated with Adiff:SVector, ForwardDiff:Tuple of zeros and a single one, and some can be propagated quite far into he computations, depending on the structure of foo. Things only get worse when differentiating to higher order. I don’t know about you, but I feel adding zeros and multiplying by ones is a waste of time :smiley: . So I messed around a little:

struct Nix end # structuraly zero
const nix = Nix()
@inline Base.:*( ::Nix   ,b::Number) = nix
@inline Base.:+( ::Nix   ,b::Number) = b
#... you get the gist
Base.show(io::IO,::Nix) = print(io,"⋅")

# make a few "sparse static" vectors
a = (nix,nix,2,nix,nix,1,nix,nix,nix,nix,nix,nix)
b = (1,nix,nix,2,nix,3,nix,4,nix,5,nix,6)

e = sum(a.*b)
#...

Both @code_native and BenchmarkTools suggest that this technique reduces the number of operations and computing times, and there is no penalty when full vectors are involved in the computations.

The idea now is that the compiler must produce specialised machine code for every combination of sparsity patterns in operations. This will obviously require longer computing times.

But my concern and question is: faced will generating a large number of specialised instances, will the compiler end up throwing the sponge and create a general code - with much worse performance than by just leaving well alone in Adiff and ForwardDiff? Of course the best is just to try, but at least in Adiff, that would be some work, so I am checking whether we can tell outright if this is not a good idea.

Generally, no. The compiler will happilly compile as many specific method instances as the user asks it too until OOM.

However I think you can likely do better than your current approach. Since you likely need to compile a lot of methods, your program is likely to be dominated by compilation times (but there’s a chance that it’s not - so measure!). Alternatively e.g. you could create a single type that just stores at which index the only non-zero element is. In fact that exists already in FillArrays.jl under the name OneElement. I don’t know how it behaves under AD though. With this approach you only need to conpile once but can still save the multiplications.

Alternatively, you could just try regular sparse arrays from the standard library.

If you can come up with a realistic benchmark, then feel free to post it here. That will likely give you the best feedback :slight_smile:

2 Likes

Thanks for the answer. Nice to know about “limitless specialisation”.

Remarks from my side.

  1. It has to be on the stack (so no Sparse) or execution times will be atrocious or I need to work on preallocated memory - but expressing maths in imperative programming, and pre-allocating… thanks but no thanks.
  2. It’s not just vectors with a single non-zero: that’s just how a “seed” looks like in forward automatic differentiation. As partials combine, partials will emerge with arbitrary sparsity patterns. Hence:
  3. Compilation times are going to be atrocious under my scheme. I have to learn about pre-compilation and caching of machine code.

I have an idea on how to create a compilation-time benchmark. I’ll come back with that in a few days.

:slight_smile:

2 Likes