If you have a computation with an iteration where the time is dominated by a large sparse matrix multiplication,
julia> using LinearAlgebra, SparseArrays, BenchmarkTools
julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);
julia> y = @btime $A*$x;
909.738 ms (2 allocations: 32.00 MiB)
julia> yt = @btime $(transpose(A))*$x;
640.637 ms (2 allocations: 32.00 MiB)
you may want to consider the CompressedSparseBlocks package, a Julia wrapper to the CSB Library.
julia> using CompressedSparseBlocks
Transforming a SparseMatrixCSC
into a SparseMatrixCSB
is straightforward, though it might take a few seconds for very large matrices.
julia> Ac = SparseMatrixCSB(A);
but the transformation cost can be eliminated with the speedup from CSB.
julia> yc = @btime $Ac*$x;
352.766 ms (2 allocations: 32.00 MiB)
julia> yc ā y
true
julia> yct = @btime $(transpose(Ac))*$x;
379.569 ms (3 allocations: 32.00 MiB)
julia> yct ā yt
true
Enjoy!
9 Likes
Thanks, this looks nice! Which scalar types does this package support?
Could you compare to the case one is using Sparse MKL?
The figure on the README is comparing CSB to MKLSparse
Currently, it supports Float64. It should be straightforward to add additional scalar types in the C/C++ interface, e.g., Bool, Float32.
It looks like it uses threads by default (which Julia does not) so a more fair comparison would be:
julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
CompressedSparseBlocks, ThreadedSparseArrays
julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);
# Regular AT * x (no threading)
julia> @btime $(transpose(A)) * $x;
466.139 ms (2 allocations: 32.00 MiB)
# SparseMatrixCSB (8 threads)
julia> @btime $(transpose(SparseMatrixCSB(A))) * $x;
155.475 ms (2 allocations: 32.00 MiB)
# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(transpose(ThreadedSparseMatrixCSC(A))) * $x;
218.281 ms (63 allocations: 32.01 MiB)
(Still some speedup though, but not as drastic as in the OP.)
1 Like
That is correct. The figure on the README compares against MKLSparse, which is also multithreaded.
One additional advantage of CSB is that the multiplication with A does not suffer from longer latency than that with the transposed matrix. The symmetric performance eliminates the need for an additional copy in a different layout (as with CSR or CSC) for reducing the speed gap at the cost of double memory consumption.
julia> using LinearAlgebra, SparseArrays, BenchmarkTools,
CompressedSparseBlocks, ThreadedSparseArrays
julia> n = 2^22; d = 10; A = sprand(n,n,d/n); x = rand(n);
julia> Threads.nthreads()
10
julia> CompressedSparseBlocks.getWorkers()
10
# Regular A * x (no threading)
julia> @btime $(A) * $x;
796.632 ms (2 allocations: 32.00 MiB)
# SparseMatrixCSB (8 threads)
julia> @btime $(SparseMatrixCSB(A)) * $x;
102.310 ms (2 allocations: 32.00 MiB)
# ThreadedSparseMatrixCSC (8 threads)
julia> @btime $(ThreadedSparseMatrixCSC(A)) * $x;
795.302 ms (20 allocations: 32.00 MiB)
1 Like
The text below the figure says MKL but the legend says differently so Iām not sure I understand the relative performance compared to MKL.
Thank you for your interest in the package!
Each plot shows the relative speedup (in wall-clock execution time) of 3 operations
- CSC transp. (via MKL)
A' * x
- CSB (CompressedSparseBlocks.jl)
Acsb * x
- CSB transp. (CompressedSparseBlocks.jl)
Acsb' * x
with respect to CSC (via MKL) A * x
.
The different plots correspond to different densities (d
) and number of vectors (RHS
).
The environment and the script for generating the figure are committed under benchmarks/run_benchmarks.jl
.
We will add more comments on the README
to clarify these points.
1 Like