Fastest way of doing (constant sparse) * SVector multiplication

natschil · October 15, 2018, 9:41pm

Background: I have an implementation of tricubic interpolation that works, but is ~20x slower than my implementation of trilinear interpolation despite only requiring (in theory) 8x as many memory accesses.

Essentially I’m looking for some tips as to how I can quickly multiply constant sparse matrices by Static Vectors. I have a constant, precomputed 64x64 sparse matrix (with some additional (nontrivial) structure) with about 1000 non-zero entries (for exact values, see CopernicusUtils.jl/coeff.jl at master · CoherentStructures/CopernicusUtils.jl · GitHub )

Let’s call this matrix A

I have a second sparse 64x64 matrix (with about 300 nonzero entries) representing various finite-difference stencils, call this matrix F (has a typo somewhere, but essentially: gist:45637c801d918b344b2a7d9102794c4b · GitHub).

I have two 64-entry @SVectors, denoted by u,w.

I now wish to compute:

```w’ * A* F * u``

Note that actually I never have w as a @SVector, as it turns out it is cheaper to just compute the entries when I need them (they are of the form x^i y^j z^k )). I’ve tried the following, where I do all multiplications “by hand” (to avoid allocations) by iterating over the nonzero elements:
a) Compute AF in advance. Compute w’AF, then multiply that by u
b) Compute a value of Fu by a different explicit formula for each row of u then the corresponding one of w’A , multiply them and sum while going along.
c) Same as (b), but use a sparse matrix representation of Fu

It turns out that (b) is faster than ( c) which is faster than (a). Note that AF has about 1300 stored entries.

The problem with approach (b) is that it is ugly, and really obscures what is going on, as I essentially have mix the code for generating F with that for multiplying by w’A. You can see something similar to approach (b) in the function base_tricubic_interpolation at CopernicusUtils.jl/interpolation.jl at master · CoherentStructures/CopernicusUtils.jl · GitHub .

My question now is: is there some trick I’ve missed for multiplying a constant sparse matrix (known at compile time, never changes) by a static vector? If I wrote a macro that really writes out the multiplication column by column, should I expect this to be faster than a loop?

tkoolen · October 15, 2018, 11:12pm

You could use a generated function to fully unroll the computation. The following is just for the problem of computing A * x, but you can extend it to your specific use case. Note that what I’m doing in the generated function isn’t exactly kosher (calling non-pure functions), but just as a quick and dirty evaluation:

using StaticArrays
using SparseArrays
using Random
using BenchmarkTools
using Test

# function to create generated function body given A
function sparse_times_svec_body(A::SparseMatrixCSC{TA}, x::Symbol) where TA
    n, m = size(A)
    coeff_initializers = Expr(:block, [:($(Symbol("y", i)) = zero(T)) for i = 1 : n]...)
    coeff_updates = Expr(:block)
    rows = rowvals(A)
    vals = nonzeros(A)
    for col = 1 : m
        for j in nzrange(A, col)
            row = rows[j]
            val = vals[j]
            push!(coeff_updates.args, :($(Symbol("y", row)) += $val * $x[$col]))
        end
    end
    quote
        T = promote_type($TA, eltype($x))
        $coeff_initializers
        $coeff_updates
        $(:(return SVector(tuple($([Symbol("y", i) for i = 1 : n]...)))))
    end
end

# Create a random 64x64 A matrix with approximately 1000 entries
const A = let
    rng = MersenneTwister(1)
    n, m = 64, 64
    p = 1000 / (n * m)
    sprand(rng, n, m, p)
end

# generated function
@generated function A_times_svec(x::SVector{m}) where m
    sparse_times_svec_body(A, :x)
end

First, to make sure the result is correct:

let x = rand(SVector{64})
    @test A * x ≈ A_times_svec(x)
end

passes.

Benchmark results:

@btime A_times_svec(x) setup = x = rand(SVector{64});

results in 271.064 ns (0 allocations: 0 bytes) on my machine, compared to the baseline

@btime A * x setup = x = rand(SVector{64});

which results in 42.020 μs (1 allocation: 624 bytes).

To see what’s going on, you can get the code that’s being generated for a toy version using e.g.:

sparse_times_svec_body(sparse([1, 2, 3], [2, 3, 4], [1.0, 2.0, 3.0]), :x)

which gives

quote
    #= In[10]:15 =#
    T = promote_type(Float64, eltype(x))
    #= In[10]:16 =#
    begin
        y1 = zero(T)
        y2 = zero(T)
        y3 = zero(T)
    end
    #= In[10]:17 =#
    begin
        y1 += 1.0 * x[2]
        y2 += 2.0 * x[3]
        y3 += 3.0 * x[4]
    end
    #= In[10]:18 =#
    return SVector(tuple(y1, y2, y3))
end

Don’t push this too far; compile times will blow up pretty quickly.

Tamas_Papp · October 16, 2018, 7:46am

Perhaps you already did this, but have you considered not using sparse matrices in this case? As they are not that sparse, a well-optimized BLAS (eg MKL) could be a simple and competitive option.

natschil · October 16, 2018, 9:35am

This is a good point. What has stopped me from doing this is that I don’t see a clear way of doing this without introducing additional allocations. Multiplying a (dense or sparse) matrix by a SVector gives back an Array, and I’m not sure whether there is any way of getting around this, as julia doesn’t allow for objects on the stack to be modified as far as I know.

natschil · October 16, 2018, 9:37am

Thanks a lot, this is very nice! I will compare this to what I currently have (which is probably a bit faster than doing A*x as that allocates something), but I already now suspect that what you have will be a lot faster!

baggepinnen · October 16, 2018, 9:52am

Are the extra allocations an actual problem if the code would end up significantly faster?

Tamas_Papp · October 16, 2018, 10:04am

Consider preallocated output, eg LinearAlgebra.mul!. What @baggepinnen says also applies.

StaticArrays is really amazing, but above a certain size (which, unfortunately, is problem-specific and can only be determined by benchmarking), the advantage is not that great.

natschil · October 16, 2018, 10:12am

@baggepinnen: No, the extra allocations are not really a problem, but so far I’ve found that whenever I reduce them, my code gets a lot faster. This is a function I want to be able to call hundreds of millions of times in a different function, so I don’t know if I want to allocate an array on each call.

@Tamas_Papp: you mean like a global preallocated output?

tkoolen · October 16, 2018, 10:33am

On my machine, the generated function is a bit faster than just treating A as dense and using mul!:

using BenchmarkTools, LinearAlgebra
A = rand(64, 64)
y = zeros(64)
@btime mul!($y, $A, x) setup = x = rand(64)

results in 364.075 ns (0 allocations: 0 bytes), compared to 271.064 ns (0 allocations: 0 bytes) for the generated function.

Tamas_Papp · October 16, 2018, 10:40am

I would refrain from globals. I usually define a composite type (struct) for problems I solve anyway; I would just create a buffer there to reuse.

Topic		Replies	Views
Fastest way to multiply a constant sparse matrix to a vector Performance	7	1899	October 22, 2018
Taking advantage of rowwise-constant sparse matrix Performance linearalgebra , sparse	7	523	August 30, 2021
Fast lookup into (almost) constant array General Usage	6	1008	May 23, 2018
Extend for vector-valued matrix multiplication? General Usage question , array	14	1150	March 25, 2020
Matrix of SVectors allocating Performance staticarrays	7	297	August 2, 2024

Fastest way of doing (constant sparse) * SVector multiplication

Related topics