[ANN] Fast SpMv with CompressedSparseBlocks.jl

pitsianis · July 23, 2022, 4:24pm

If you have a computation with an iteration where the time is dominated by a large sparse matrix multiplication,

julia> using LinearAlgebra, SparseArrays, BenchmarkTools

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

julia> y = @btime $A*$x;
  909.738 ms (2 allocations: 32.00 MiB)

julia> yt = @btime $(transpose(A))*$x;
  640.637 ms (2 allocations: 32.00 MiB)

you may want to consider the CompressedSparseBlocks package, a Julia wrapper to the CSB Library.

julia> using CompressedSparseBlocks

Transforming a SparseMatrixCSC into a SparseMatrixCSB is straightforward, though it might take a few seconds for very large matrices.

julia> Ac = SparseMatrixCSB(A);

but the transformation cost can be eliminated with the speedup from CSB.

julia> yc = @btime $Ac*$x;
  352.766 ms (2 allocations: 32.00 MiB)

julia> yc ≈ y
true

julia> yct = @btime $(transpose(Ac))*$x;
  379.569 ms (3 allocations: 32.00 MiB)

julia> yct ≈ yt
true

Enjoy!

ranocha · July 25, 2022, 6:18am

Thanks, this looks nice! Which scalar types does this package support?

RoyiAvital · July 25, 2022, 6:39am

Could you compare to the case one is using Sparse MKL?

fcdimitr · July 25, 2022, 10:23am

The figure on the README is comparing CSB to MKLSparse

fcdimitr · July 25, 2022, 10:26am

Currently, it supports Float64. It should be straightforward to add additional scalar types in the C/C++ interface, e.g., Bool, Float32.

fredrikekre · July 25, 2022, 10:53am

It looks like it uses threads by default (which Julia does not) so a more fair comparison would be:

julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
             CompressedSparseBlocks, ThreadedSparseArrays

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

# Regular AT * x (no threading)
julia> @btime $(transpose(A)) * $x;
  466.139 ms (2 allocations: 32.00 MiB)

# SparseMatrixCSB (8 threads)
julia> @btime $(transpose(SparseMatrixCSB(A))) * $x;
  155.475 ms (2 allocations: 32.00 MiB)

# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(transpose(ThreadedSparseMatrixCSC(A))) * $x;
  218.281 ms (63 allocations: 32.01 MiB)

(Still some speedup though, but not as drastic as in the OP.)

fcdimitr · July 25, 2022, 10:59am

That is correct. The figure on the README compares against MKLSparse, which is also multithreaded.

fcdimitr · July 25, 2022, 1:38pm

One additional advantage of CSB is that the multiplication with A does not suffer from longer latency than that with the transposed matrix. The symmetric performance eliminates the need for an additional copy in a different layout (as with CSR or CSC) for reducing the speed gap at the cost of double memory consumption.

julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
             CompressedSparseBlocks, ThreadedSparseArrays

julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);

julia> Threads.nthreads()
10

julia> CompressedSparseBlocks.getWorkers()
10

# Regular A * x (no threading)
julia> @btime $(A) * $x;
  796.632 ms (2 allocations: 32.00 MiB)

# SparseMatrixCSB (8 threads)
julia> @btime $(SparseMatrixCSB(A)) * $x;
  102.310 ms (2 allocations: 32.00 MiB)

# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(ThreadedSparseMatrixCSC(A)) * $x;
  795.302 ms (20 allocations: 32.00 MiB)

RoyiAvital · July 26, 2022, 11:07am

The text below the figure says MKL but the legend says differently so I’m not sure I understand the relative performance compared to MKL.

fcdimitr · July 26, 2022, 1:01pm

Thank you for your interest in the package!
Each plot shows the relative speedup (in wall-clock execution time) of 3 operations

CSC transp. (via MKL) A' * x
CSB (CompressedSparseBlocks.jl) Acsb * x
CSB transp. (CompressedSparseBlocks.jl) Acsb' * x

with respect to CSC (via MKL) A * x.

The different plots correspond to different densities (d) and number of vectors (RHS).

The environment and the script for generating the figure are committed under benchmarks/run_benchmarks.jl.

We will add more comments on the README to clarify these points.

Topic		Replies	Views
Block Sparse Matrix General Usage sparse	2	1807	September 27, 2022
Sparse matrix multiplication: SparseMatrixCSC can be ~100x slower than Base.SparseArrays.CHOLMOD.Sparse General Usage performance	4	5064	February 6, 2017
How to utilize "MKLSparse.jl"? General Usage question	14	1511	September 1, 2022
Spase by dense matrix product General Usage	0	475	September 23, 2017
Speed comparison: Sparse matrix multiplication vs usual matrix multiplication General Usage question	7	3942	February 13, 2017

[ANN] Fast SpMv with CompressedSparseBlocks.jl

Related topics