[ANN] Fast SpMv with CompressedSparseBlocks.jl

If you have a computation with an iteration where the time is dominated by a large sparse matrix multiplication,

julia> using LinearAlgebra, SparseArrays, BenchmarkTools

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

julia> y = @btime $A*$x;
  909.738 ms (2 allocations: 32.00 MiB)

julia> yt = @btime $(transpose(A))*$x;
  640.637 ms (2 allocations: 32.00 MiB)

you may want to consider the CompressedSparseBlocks package, a Julia wrapper to the CSB Library.

julia> using CompressedSparseBlocks

Transforming a SparseMatrixCSC into a SparseMatrixCSB is straightforward, though it might take a few seconds for very large matrices.

julia> Ac = SparseMatrixCSB(A);

but the transformation cost can be eliminated with the speedup from CSB.

julia> yc = @btime $Ac*$x;
  352.766 ms (2 allocations: 32.00 MiB)

julia> yc ā‰ˆ y
true

julia> yct = @btime $(transpose(Ac))*$x;
  379.569 ms (3 allocations: 32.00 MiB)

julia> yct ā‰ˆ yt
true

Enjoy!

9 Likes

Thanks, this looks nice! Which scalar types does this package support?

Could you compare to the case one is using Sparse MKL?

The figure on the README is comparing CSB to MKLSparse

Currently, it supports Float64. It should be straightforward to add additional scalar types in the C/C++ interface, e.g., Bool, Float32.

It looks like it uses threads by default (which Julia does not) so a more fair comparison would be:

julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
             CompressedSparseBlocks, ThreadedSparseArrays

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

# Regular AT * x (no threading)
julia> @btime $(transpose(A)) * $x;
  466.139 ms (2 allocations: 32.00 MiB)

# SparseMatrixCSB (8 threads)
julia> @btime $(transpose(SparseMatrixCSB(A))) * $x;
  155.475 ms (2 allocations: 32.00 MiB)

# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(transpose(ThreadedSparseMatrixCSC(A))) * $x;
  218.281 ms (63 allocations: 32.01 MiB)

(Still some speedup though, but not as drastic as in the OP.)

1 Like

That is correct. The figure on the README compares against MKLSparse, which is also multithreaded.

One additional advantage of CSB is that the multiplication with A does not suffer from longer latency than that with the transposed matrix. The symmetric performance eliminates the need for an additional copy in a different layout (as with CSR or CSC) for reducing the speed gap at the cost of double memory consumption.

julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
             CompressedSparseBlocks, ThreadedSparseArrays

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

julia> Threads.nthreads()
10

julia> CompressedSparseBlocks.getWorkers()
10

# Regular A * x (no threading)
julia> @btime $(A) * $x;
  796.632 ms (2 allocations: 32.00 MiB)

# SparseMatrixCSB (8 threads)
julia> @btime $(SparseMatrixCSB(A)) * $x;
  102.310 ms (2 allocations: 32.00 MiB)

# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(ThreadedSparseMatrixCSC(A)) * $x;
  795.302 ms (20 allocations: 32.00 MiB)

1 Like

The text below the figure says MKL but the legend says differently so Iā€™m not sure I understand the relative performance compared to MKL.

Thank you for your interest in the package!
Each plot shows the relative speedup (in wall-clock execution time) of 3 operations

  • CSC transp. (via MKL) A' * x
  • CSB (CompressedSparseBlocks.jl) Acsb * x
  • CSB transp. (CompressedSparseBlocks.jl) Acsb' * x

with respect to CSC (via MKL) A * x.

The different plots correspond to different densities (d) and number of vectors (RHS).

The environment and the script for generating the figure are committed under benchmarks/run_benchmarks.jl.

We will add more comments on the README to clarify these points.

1 Like