Background: I have an implementation of tricubic interpolation that works, but is ~20x slower than my implementation of trilinear interpolation despite only requiring (in theory) 8x as many memory accesses.

Essentially I’m looking for some tips as to how I can quickly multiply constant sparse matrices by Static Vectors. I have a constant, precomputed 64x64 sparse matrix (with some additional (nontrivial) structure) with about 1000 non-zero entries (for exact values, see https://github.com/CoherentStructures/CopernicusUtils.jl/blob/master/src/coeff.jl )

Let’s call this matrix A

I have a second sparse 64x64 matrix (with about 300 nonzero entries) representing various finite-difference stencils, call this matrix F (has a typo somewhere, but essentially: https://gist.github.com/natschil/45637c801d918b344b2a7d9102794c4b).

I have two 64-entry @SVectors, denoted by u,w.

I now wish to compute:

```w’ * A* F * u``

Note that actually I never have w as a @SVector, as it turns out it is cheaper to just compute the entries when I need them (they are of the form x^i y^j z^k )). I’ve tried the following, where I do all multiplications “by hand” (to avoid allocations) by iterating over the nonzero elements:

a) Compute AF in advance. Compute w’AF, then multiply that by u

b) Compute a value of Fu by a different explicit formula for each row of u then the corresponding one of w’A , multiply them and sum while going along.

c) Same as (b), but use a sparse matrix representation of Fu

It turns out that (b) is faster than ( c) which is faster than (a). Note that AF has about 1300 stored entries.

The problem with approach (b) is that it is ugly, and really obscures what is going on, as I essentially have mix the code for generating F with that for multiplying by w’A. You can see something similar to approach (b) in the function `base_tricubic_interpolation`

at https://github.com/CoherentStructures/CopernicusUtils.jl/blob/master/src/interpolation.jl .

My question now is: is there some trick I’ve missed for multiplying a constant sparse matrix (known at compile time, never changes) by a static vector? If I wrote a macro that really writes out the multiplication column by column, should I expect this to be faster than a loop?