Sparse matrix vector faster than dense matrix vector?

As I was teaching sparse matrices in class, I was humbled by a theory-practice gap.

The idea in theory: don’t use sparse matrices when you really have dense matrix.

The idea in practice: here, let me show you in Julia!


n = 10 
using BenchmarkTools, SparseArrays
@benchmark A*v setup=begin
    A = randn(n,n)
    v = randn(n)
end
using BenchmarkTools
@benchmark A*v setup=begin
    A = sparse(randn(n,n))
    v = randn(n)
end

And then this shows:

Dense case

BenchmarkTools.Trial: 10000 samples with 961 evaluations.
 Range (min … max):  86.672 ns … 798.907 ns  ┊ GC (min … max): 0.00% … 85.88%
 Time  (median):     89.663 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.006 ns ±  23.905 ns  ┊ GC (mean ± σ):  1.33% ±  4.33%

Sparse case

BenchmarkTools.Trial: 10000 samples with 965 evaluations.
 Range (min … max):  81.649 ns … 804.188 ns  ┊ GC (min … max): 0.00% … 86.96%
 Time  (median):     86.010 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.411 ns ±  31.736 ns  ┊ GC (mean ± σ):  1.85% ±  4.61%

So the sparse case is 3 ns faster than the dense case ?!?

As I said, I was humbled in lecture :slight_smile:


Notes

  • Yes, the difference goes away for larger n and sparse is about 3x slower than dense.
  • I haven’t seen this on any Intel/AMD x86-64 architectures.
  • This does not happen on Julia 1.8, so something changed in 1.9 that made the sparse case faster.
  • Using the following code shows that the dense case really should be faster…
## Copy-pasted sparse matrix vector product without allocations
using BenchmarkTools, SparseArrays, LinearAlgebra
function _spmatmul!(C, A, B, α, β)
    size(A, 2) == size(B, 1) || throw(DimensionMismatch())
    size(A, 1) == size(C, 1) || throw(DimensionMismatch())
    size(B, 2) == size(C, 2) || throw(DimensionMismatch())
    nzv = nonzeros(A)
    rv = rowvals(A)
    β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
    for k in 1:size(C, 2)
        @inbounds for col in 1:size(A, 2)
            αxj = B[col,k] * α
            for j in nzrange(A, col)
                C[rv[j], k] += nzv[j]*αxj
            end
        end
    end
    C
end
@benchmark _spmatmul!(x,A,v,1.0,0.0) setup=begin
    A = sparse(randn(n,n))
    v = randn(n)
    x = zeros(n) 
end
## Equivalent dense-matrix vector product 
using BenchmarkTools, SparseArrays, LinearAlgebra
function _matmul!(C, A, B, α, β)
    size(A, 2) == size(B, 1) || throw(DimensionMismatch())
    size(A, 1) == size(C, 1) || throw(DimensionMismatch())
    size(B, 2) == size(C, 2) || throw(DimensionMismatch())
    β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
    for k in 1:size(C, 2)
        @inbounds for col in 1:size(A, 2)
            αxj = B[col,k] * α
            for i in 1:size(A,1)
                C[i, k] += A[i,col]*αxj
            end
        end
    end
    C
end
@benchmark _matmul!(x,A,v,1.0,0.0) setup=begin
    A = randn(n,n)
    v = randn(n)
    x = zeros(n) 
end
@benchmark mul!(x,A,v,1.0,0.0) setup=begin
    A = randn(n,n)
    v = randn(n)
    x = zeros(n) 
end

Let’s do our bakeoff…

Julia 1.8.4 (MacBook Air M2)

dense _matmul! - 44.7 ns
dense LinearAlgebra.mul! - 73.8 ns
sparse _spmatmul! - 74.3 ns

Julia 1.9.2 (MacBook Air M2)

dense _matmul! - 43.8 ns
dense LinearAlgebra.mul! - 73.0 ns
sparse _spmatmul! - 68.2 ns


The question: does anyone know what might have changed in 1.9 that would have made the sparse method faster?

Also, anyone know why the dense code for small n is slower than expected?

Some observations:

  • It doesn’t look like your sparse array is actually sparse.
    The fill factor looks greater than 0.95 (at least)

  • Standard deviation in the measurements is >24 nsec
    so a 3-5 nsec difference is in the noise (not significant)

  • With matrix multiply routines, hardware/processor differences
    can affect the results.

I believe what’s happening here is that the M1/M2 chips is really good at prefetching memory though pointers which makes the sparse stuff work better. Also with a completely dense sparse matrix, the processor is probably able to fully branch predict all the branches on where the next data will be. The really bad case for sparse arrays is between ~10% and 90% full since that is where it’s very unpredictable.

@dgleich I wonder what the situation with multithreading is? The dense matrix multiplication would run on multiple threads by default, I believe? Now, that is probably moot for the small matrices, but at some point it will mix things up, won’t it?

The prefetching on the M1/M2 doesn’t explain why this wasn’t as fast on Julia 1.8 though. Also doesn’t explain why the dense case is slower than you might expect.

@PetrKryslUCSD Oh – good idea to check that!

Threads.nthreads() = 1
LinearAlgebra.BLAS.get_num_threads() = 4

so I re-ran after LinearAlgebra.BLAS.set_num_threads(1)

and got the same 73 ns result. So that doesn’t help explain it. I will have to double-check with bigger matrices if that explains the factor of 3 vs. sparse.

The dense case is slower because there is a pretty high O(1) overhead for calling BLAS (due to calling conventions and the like). I’m also not that surprised about Julia 1.8 since we had pretty poor support for M1 in 1.8.

Okay thanks – I think that’s the other missing piece (the high BLAS overhead) – would it be far to say this is most likely due to the following?

  • BLAS overhead is higher on AppleARM than x86-64?
  • AppleARM gets better perf for simple codes than x86-64?

So these combine to give a different trade off point between simple codes/etc for small n?

e.g. I was finally able to replicate similar behavior on on Intel machine with n=6 matrix.

Thanks!

If you have such small matrices and the size is a fixed compile-time constant and you care a lot about performance, you are often better off telling the compiler the size of the array with StaticArrays.jl so that it just unrolls everything.

2 Likes

Thanks for pointing this out!

Agree that StaticArrays would be better for a fixed small size. In this case, it runs in 6 ns :slight_smile:

1 Like