Fastest way of doing (constant sparse) * SVector multiplication

Background: I have an implementation of tricubic interpolation that works, but is ~20x slower than my implementation of trilinear interpolation despite only requiring (in theory) 8x as many memory accesses.

Essentially I’m looking for some tips as to how I can quickly multiply constant sparse matrices by Static Vectors. I have a constant, precomputed 64x64 sparse matrix (with some additional (nontrivial) structure) with about 1000 non-zero entries (for exact values, see )

Let’s call this matrix A

I have a second sparse 64x64 matrix (with about 300 nonzero entries) representing various finite-difference stencils, call this matrix F (has a typo somewhere, but essentially:

I have two 64-entry @SVectors, denoted by u,w.

I now wish to compute:

```w’ * A* F * u``

Note that actually I never have w as a @SVector, as it turns out it is cheaper to just compute the entries when I need them (they are of the form x^i y^j z^k )). I’ve tried the following, where I do all multiplications “by hand” (to avoid allocations) by iterating over the nonzero elements:
a) Compute AF in advance. Compute w’AF, then multiply that by u
b) Compute a value of Fu by a different explicit formula for each row of u then the corresponding one of w’A , multiply them and sum while going along.
c) Same as (b), but use a sparse matrix representation of Fu

It turns out that (b) is faster than ( c) which is faster than (a). Note that AF has about 1300 stored entries.

The problem with approach (b) is that it is ugly, and really obscures what is going on, as I essentially have mix the code for generating F with that for multiplying by w’A. You can see something similar to approach (b) in the function base_tricubic_interpolation at .

My question now is: is there some trick I’ve missed for multiplying a constant sparse matrix (known at compile time, never changes) by a static vector? If I wrote a macro that really writes out the multiplication column by column, should I expect this to be faster than a loop?

You could use a generated function to fully unroll the computation. The following is just for the problem of computing A * x, but you can extend it to your specific use case. Note that what I’m doing in the generated function isn’t exactly kosher (calling non-pure functions), but just as a quick and dirty evaluation:

using StaticArrays
using SparseArrays
using Random
using BenchmarkTools
using Test

# function to create generated function body given A
function sparse_times_svec_body(A::SparseMatrixCSC{TA}, x::Symbol) where TA
    n, m = size(A)
    coeff_initializers = Expr(:block, [:($(Symbol("y", i)) = zero(T)) for i = 1 : n]...)
    coeff_updates = Expr(:block)
    rows = rowvals(A)
    vals = nonzeros(A)
    for col = 1 : m
        for j in nzrange(A, col)
            row = rows[j]
            val = vals[j]
            push!(coeff_updates.args, :($(Symbol("y", row)) += $val * $x[$col]))
        T = promote_type($TA, eltype($x))
        $(:(return SVector(tuple($([Symbol("y", i) for i = 1 : n]...)))))

# Create a random 64x64 A matrix with approximately 1000 entries
const A = let
    rng = MersenneTwister(1)
    n, m = 64, 64
    p = 1000 / (n * m)
    sprand(rng, n, m, p)

# generated function
@generated function A_times_svec(x::SVector{m}) where m
    sparse_times_svec_body(A, :x)

First, to make sure the result is correct:

let x = rand(SVector{64})
    @test A * x ≈ A_times_svec(x)


Benchmark results:

@btime A_times_svec(x) setup = x = rand(SVector{64});

results in 271.064 ns (0 allocations: 0 bytes) on my machine, compared to the baseline

@btime A * x setup = x = rand(SVector{64});

which results in 42.020 μs (1 allocation: 624 bytes).

To see what’s going on, you can get the code that’s being generated for a toy version using e.g.:

sparse_times_svec_body(sparse([1, 2, 3], [2, 3, 4], [1.0, 2.0, 3.0]), :x)

which gives

    #= In[10]:15 =#
    T = promote_type(Float64, eltype(x))
    #= In[10]:16 =#
        y1 = zero(T)
        y2 = zero(T)
        y3 = zero(T)
    #= In[10]:17 =#
        y1 += 1.0 * x[2]
        y2 += 2.0 * x[3]
        y3 += 3.0 * x[4]
    #= In[10]:18 =#
    return SVector(tuple(y1, y2, y3))

Don’t push this too far; compile times will blow up pretty quickly.

Perhaps you already did this, but have you considered not using sparse matrices in this case? As they are not that sparse, a well-optimized BLAS (eg MKL) could be a simple and competitive option.

1 Like

This is a good point. What has stopped me from doing this is that I don’t see a clear way of doing this without introducing additional allocations. Multiplying a (dense or sparse) matrix by a SVector gives back an Array, and I’m not sure whether there is any way of getting around this, as julia doesn’t allow for objects on the stack to be modified as far as I know.

Thanks a lot, this is very nice! I will compare this to what I currently have (which is probably a bit faster than doing A*x as that allocates something), but I already now suspect that what you have will be a lot faster!

Are the extra allocations an actual problem if the code would end up significantly faster?

1 Like

Consider preallocated output, eg LinearAlgebra.mul!. What @baggepinnen says also applies.

StaticArrays is really amazing, but above a certain size (which, unfortunately, is problem-specific and can only be determined by benchmarking), the advantage is not that great.

@baggepinnen: No, the extra allocations are not really a problem, but so far I’ve found that whenever I reduce them, my code gets a lot faster. This is a function I want to be able to call hundreds of millions of times in a different function, so I don’t know if I want to allocate an array on each call.

@Tamas_Papp: you mean like a global preallocated output?

On my machine, the generated function is a bit faster than just treating A as dense and using mul!:

using BenchmarkTools, LinearAlgebra
A = rand(64, 64)
y = zeros(64)
@btime mul!($y, $A, x) setup = x = rand(64)

results in 364.075 ns (0 allocations: 0 bytes), compared to 271.064 ns (0 allocations: 0 bytes) for the generated function.

I would refrain from globals. I usually define a composite type (struct) for problems I solve anyway; I would just create a buffer there to reuse.