[ANN] Announcing ThreadedSparseCSR.jl

ThreadedSparseCSR.jl provides a multithreaded version of CSR matrix - vector multiplication in Julia.

The CSR matrix format is implemented in the Julia package SparseMatricesCSR.jl, which must be installed for this package to work.

The package exports the functions:

  • tmul!(y, A, x, [alpha], [beta]), 5 argument (y = alpha*A*x +beta*y ) and 3 argument (y = A*x) in-place multithreaded versions of mul!, using Base.Threads threading (using @spawn)
  • tmul(A, x), multithreaded version of A*x, using Base.Threads threading (using @spawn)
  • bmul!(y, A, x, [alpha], [beta]), 5 argument (y = alpha*A*x +beta*y ) and 3 argument (y = A*x) in-place multithreaded versions of mul!, using Polyester.jl threading (using @batch)
  • bmul(A, x), multithreaded version of A*x, using Polyester.jl threading (using @batch)

It is possible to overwrite the function * and mul! by their multithreaded versions. This is done using the function:

ThreadedSparseCSR.multithread_matmul(PolyesterThreads())

which overwrites * and mul! by bmul and bmul!, respectivelly;

ThreadedSparseCSR.multithread_matmul(BaseThreads())

which overwrites * and mul! by tmul and tmul!, respectivelly;

ThreadedSparseCSR.multithread_matmul()

by default, overwrites * and mul! by bmul and bmul!, respectivelly.

It is also possible to change the number of threads that are used, using the function

ThreadedSparseCSR.set_num_threads(4)

The number of threads that is being used is obtained via:

ThreadedSparseCSR.get_num_threads()

By default, ThreadedSparseCSR.get_num_threads()==Threads.nthreads().

Some benchmarks

Let us compare the performance of multithreaded sparse CSR matrix - vec as implemented in this package, with the non-threaded version and the multithreaded sparse CSC matrix - vec multiplication provided by MKLSparse.jl (both for non-transposed and transposed matrix).

versioninfo()
Julia Version 1.6.3
Commit ae8452a9e0 (2021-09-23 17:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 7 PRO 4750G with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
  JULIA_NUM_THREADS = 8

Code for benchmark (and produce plot with PyPlot.jl):

Summary
using LinearAlgebra, SparseArrays, SparseMatricesCSR, ThreadedSparseCSR
using MKLSparse # to enable multithreaded Sparse CSC Matrix-Vec multiplication
using BenchmarkTools, PyPlot

ThreadedSparseCSR.set_num_threads(Threads.nthreads()) #use all the available Julia threads in the mat-vec mul

function benchmark_csr_mv(sizes, densities)
    
    times_csc = zeros(length(sizes), length(densities))
    times_csc_transpose = zeros(length(sizes), length(densities))
    times_csr_mul = zeros(length(sizes), length(densities))
    times_csr_bmul = zeros(length(sizes), length(densities))
    times_csr_tmul = zeros(length(sizes), length(densities))
    
    for (j, d) in enumerate(densities)
        for (i, n) in enumerate(sizes)
            num_nzs = floor(Int, n*n*d)
            rows = rand(1:n, num_nzs)
            cols = rand(1:n, num_nzs)
            vals = rand(num_nzs)
            
            cscA = sparse(rows, cols, vals, n, n)
            cscAt = transpose(cscA)
            csrA = sparsecsr(rows, cols, vals, n, n)
            
            x = rand(n)
            y1 = zeros(n)
            y2 = zeros(n)
            y3 = zeros(n)
            y4 = zeros(n)
            y5 = zeros(n)
            
            b_csc = @benchmark mul!($y1, $cscA, $x, true, false)
            times_csc[i, j] = minimum(b_csc).time/1000 # time in microseconds
            
            b_csc_transpose = @benchmark mul!($y2, $cscAt, $x, true, false)
            times_csc_transpose[i, j] = minimum(b_csc_transpose).time/1000 # time in microseconds
            
            b_csr_mul = @benchmark mul!($y3, $csrA, $x, true, false)
            times_csr_mul[i, j] = minimum(b_csr_mul).time/1000 # time in microseconds
            
            b_csr_bmul = @benchmark bmul!($y4, $csrA, $x, true, false)
            times_csr_bmul[i, j] = minimum(b_csr_bmul).time/1000 # time in microseconds
            
            b_csr_tmul = @benchmark tmul!($y5, $csrA, $x, true, false)
            times_csr_tmul[i, j] = minimum(b_csr_tmul).time/1000 # time in microseconds
            
        end
    end
    
    return times_csc, times_csc_transpose, times_csr_mul, times_csr_bmul, times_csr_tmul
    
end

sizes = [1_000, 5_000, 10_000, 50_000, 100_000]
densities = [0.01, 0.001, 0.0001]

times_csc, times_csc_transpose, times_csr_mul, times_csr_bmul, times_csr_tmul = benchmark_csr_mv(sizes, densities)

f, ax = subplots(1, 3, figsize = (13, 5))

for (i, d) in enumerate(densities)
    ax[i].loglog(sizes, times_csc[:, i], marker = "v", label = "MKLSparse, CSC")
    ax[i].loglog(sizes, times_csc_transpose[:, i], marker = "^", label = "MKLSparse, transpose(CSC)")
    ax[i].loglog(sizes, times_csr_mul[:, i], marker = "h", label = "non-threaded mul (CSR)")
    ax[i].loglog(sizes, times_csr_bmul[:, i], marker = "s", label = "bmul (CSR)")
    ax[i].loglog(sizes, times_csr_tmul[:, i], marker = "o", label = "tmul (CSR)")
    
    ax[i].set_title("non-zero density = $(d)")
    ax[i].set_xlabel("matrix size")
    ax[i].set_ylabel("minimum time [μs]")
    ax[i].set_xticks(sizes)
    ax[i].set_xticklabels(sizes)
end

legend()
tight_layout()

savefig("benchmark_csr_matvec.png", dpi = 300)

Code for benchmark (and produce plot with Plots.jl):

Summary
using LinearAlgebra, SparseArrays, SparseMatricesCSR, ThreadedSparseCSR
using MKLSparse # to enable multithreaded Sparse CSC Matrix-Vec multiplication
using BenchmarkTools, Plots

ThreadedSparseCSR.set_num_threads(Threads.nthreads()) #use all the available Julia threads in the mat-vec mul

function benchmark_csr_mv(sizes, densities)
    
    times_csc = zeros(length(sizes), length(densities))
    times_csc_transpose = zeros(length(sizes), length(densities))
    times_csr_mul = zeros(length(sizes), length(densities))
    times_csr_bmul = zeros(length(sizes), length(densities))
    times_csr_tmul = zeros(length(sizes), length(densities))
    
    for (j, d) in enumerate(densities)
        for (i, n) in enumerate(sizes)
            num_nzs = floor(Int, n*n*d)
            rows = rand(1:n, num_nzs)
            cols = rand(1:n, num_nzs)
            vals = rand(num_nzs)
            
            cscA = sparse(rows, cols, vals, n, n)
            cscAt = transpose(cscA)
            csrA = sparsecsr(rows, cols, vals, n, n)
            
            x = rand(n)
            y1 = zeros(n)
            y2 = zeros(n)
            y3 = zeros(n)
            y4 = zeros(n)
            y5 = zeros(n)
            
            b_csc = @benchmark mul!($y1, $cscA, $x, true, false)
            times_csc[i, j] = minimum(b_csc).time/1000 # time in microseconds
            
            b_csc_transpose = @benchmark mul!($y2, $cscAt, $x, true, false)
            times_csc_transpose[i, j] = minimum(b_csc_transpose).time/1000 # time in microseconds
            
            b_csr_mul = @benchmark mul!($y3, $csrA, $x, true, false)
            times_csr_mul[i, j] = minimum(b_csr_mul).time/1000 # time in microseconds
            
            b_csr_bmul = @benchmark bmul!($y4, $csrA, $x, true, false)
            times_csr_bmul[i, j] = minimum(b_csr_bmul).time/1000 # time in microseconds
            
            b_csr_tmul = @benchmark tmul!($y5, $csrA, $x, true, false)
            times_csr_tmul[i, j] = minimum(b_csr_tmul).time/1000 # time in microseconds
            
        end
    end
    
    return times_csc, times_csc_transpose, times_csr_mul, times_csr_bmul, times_csr_tmul
    
end

sizes = [1_000, 5_000, 10_000, 50_000, 100_000]
densities = [0.01, 0.001, 0.0001]

times_csc, times_csc_transpose, times_csr_mul, times_csr_bmul, times_csr_tmul = benchmark_csr_mv(sizes, densities)

p = Any[]

for (i, d) in enumerate(densities)
    pl = plot(sizes, times_csc[:, i], marker = :dtriangle, 
        size = (250, 400), label = "MKLSparse, CSC", legend = :topleft)
    plot!(pl, sizes, times_csc_transpose[:, i], marker = :utriangle, label = "MKLSparse, transpose(CSC)")
    plot!(pl, sizes, times_csr_mul[:, i], marker = :hexagon, label = "non-threaded mul (CSR)")
    plot!(pl, sizes, times_csr_bmul[:, i], marker = :rect, label = "bmul (CSR)")
    plot!(pl, sizes, times_csr_tmul[:, i], marker = :circle, label = "tmul (CSR)")
    
    xaxis!(pl, "matrix size", :log10)
    yaxis!(pl, "minimum time [μs]", :log10)
    title!(pl, "non-zero density = $(d)")
    xticks!(pl, sizes, [string(i) for i in sizes])
    
    push!(p, pl)
    
    #ax[i].set_title("non-zero density = $(d)")
    #ax[i].set_xlabel("matrix size")
    #ax[i].set_ylabel("minimum time [μs]")
    #ax[i].set_xticks(sizes)
    #ax[i].set_xticklabels(sizes)
end

plot(p[1], p[2], p[3], layout = (1, 3), size = (1200, 400))

savefig("benchmark_csr_matvec.png.png")

Example Usage

using SparseArrays, SparseMatricesCSR, ThreadedSparseCSR
using BenchmarkTools

ThreadedSparseCSR.set_num_threads(Threads.nthreads()) #we must set the number of threads before hand. Here, we will use all the available Julia threads in the mat-vec mul

m, n = 1_000, 1_000
d = 0.01
num_nzs = floor(Int, m*n*d)
rows = rand(1:m, num_nzs)
cols = rand(1:n, num_nzs)
vals = rand(num_nzs)

cscA = sparse(rows, cols, vals, m, n)
csrA = sparsecsr(rows, cols, vals, m, n)
x = rand(n)

y1 = rand(n)
y2 = copy(y1)
y3 = copy(y1)


@btime mul!($y1, $csrA, $x, true, false) # non-threaded version
@btime bmul!($y2, $csrA, $x, true, false) # multithreaded version using Polyester.@batch
@btime tmul!($y3, $csrA, $x, true, false) # multithreaded version using Base.Threads.@spawn

ThreadedSparseCSR.multithread_matmul()
@btime mul!($y1, $csrA, $x, true, false) # multithreaded version using Polyester.@batch

ThreadedSparseCSR.multithread_matmul(BaseThreads())
@btime mul!($y1, $csrA, $x, true, false) # multithreaded version using Base.Threads.@spawn

ThreadedSparseCSR.multithread_matmul(PolyesterThreads())
@btime mul!($y1, $csrA, $x, true, false) # multithreaded version using Polyester.@batch


# We can later change the number of threads used:
ThreadedSparseCSR.get_num_threads()
ThreadedSparseCSR.set_num_threads(4)
@btime mul!($y1, $csrA, $x, true, false) # multithreaded version using Polyester.@batch

Acknowlegments

This package was influenced and inspired by:

I hope people find this useful!

EDIT:

  1. Added benchmark code, which uses Plots.jl instead of PyPlot.jl (which has problems in Julia 1.7).
  2. Added a line to set the number of threads to Threads.nthreads() in all codes. This was suppose to be the default, but it seems that ThreadedSparseCSR.get_num_threads() gets stuck to the value of Threads.nthreads() when the package is first precompiled.
  3. In the patch version 0.1.1, the number of used threads is automatically set to the number of Julia threads, when the package is loaded.
13 Likes

This seems cool! Can you say a few words about what situations someone should reach for this package in? Should it be used by anyone who wants sparse matrix multiplications to go faster? When should it be used instead of the built in CSC? What are the relevant differences? Etc.

This package should be useful when used in iterative methods to solve linear problems or eigenproblems.

As far as I understand, the CSR format is more efficient for mat-vec product than the CSC, specially for multithreading (as shown in the plot, comparing MKL CSC vs MKL tanspose CSC). So for any application that is based on mat-vec products, the use of SparseMatricesCSR.jl together with this package should be a good fit.

1 Like

If other people could run the benchmark on their computer and share the results that would be great!

I am particularly interested in the comparison against MKLSparse.jl on Intel machines and on other generations of AMD processors (Zen 1 and Zen 3).

2 Likes

I ran this benckmark using Julia17 and 32 threads:: julia -O3 --check-bounds=no -t32 . As I do not have PyPlot installed (it is not working to be honest) I will give you the numeric outputs.

versioninfo()

Commit f23fc0d27a* (2021-10-20 12:45 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: AMD Ryzen Threadripper 3970X 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)

julia> times_csc
5×3 Matrix{Float64}:
    18.857    15.644    14.178
    73.824    51.125    41.207
   206.246    87.513    70.053
  7795.23    715.401   403.064
 32161.9    3607.18   1012.51
julia> times_csc_transpose
5×3 Matrix{Float64}:
     6.8166     6.4394    7.089
    16.064      8.35767   7.869
    44.001     11.314     8.218
  7021.16     114.472    23.118
 28261.9     2207.59     63.836

julia> times_csr_mul
5×3 Matrix{Float64}:
     8.68367        1.4667     1.1734
   168.391         39.322      8.381
  1055.54         107.837     37.436
 46545.5         5454.24     459.846
     2.22207e5  21504.3     2034.88

julia> times_csr_bmul
5×3 Matrix{Float64}:
     3.1255     2.204      2.09533
    25.003      7.19375    3.13425
    85.139     15.575      5.20333
  7540.31     301.372     56.782
 29627.4     2205.22     166.994
julia> times_csr_tmul
5×3 Matrix{Float64}:
    16.552    18.509   16.553
    68.166    29.753   14.597
   163.712    54.338   27.379
  7016.56    528.99   122.574
 28895.4    2985.01   247.943
1 Like

Nicely done @Bruno_Amorim !

My results using an i7 Macbook Pro

julia> times_csc
5×3 Matrix{Float64}:
     5.368     3.538    3.21338
    72.009    19.329    8.584
   317.631    52.429   17.472
 17536.3    1393.88   211.075
 74044.0    7970.42   785.862
julia> times_csc_transpose
5×3 Matrix{Float64}:
     2.55489     1.4975    1.2155
    44.296       7.562     3.56367
   232.293      26.924     7.82225
 17405.6      1045.4     164.356
 72842.3      7149.32    571.99
julia> times_csr_mul
5×3 Matrix{Float64}:
     9.788          1.9144     1.4889
   201.339         43.974     15.611
  1077.84         119.3       56.444
 35855.4         3654.65     495.058
     1.64171e5  25176.6     1955.39
julia> times_csr_bmul
5×3 Matrix{Float64}:
    10.303          1.7151     1.2746
   192.569         40.837     12.591
  1063.45         104.774     49.154
 35598.3         3560.87     444.183
     1.66869e5  26005.9     1894.56
julia> times_csr_tmul
5×3 Matrix{Float64}:
    28.104        18.232    17.218
   217.532        65.618    33.606
  1092.86        142.145    76.21
 36700.2        3661.0     548.611
     1.6493e5  23094.2    2045.93
julia> versioninfo()
Julia Version 1.7.0-rc1
Commit 9eade6195e (2021-09-12 06:45 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

So it seems (unsurprisingly) that the comparative advantage of MKL is still stronger on Intel hardware, despite reports that MKL had began to support Zen architectures

1 Like

Thank you for running the benchmark @CodeLenz! There appears to be a problem with PyPlot in Julia 1.7.

I took the liberty to produce the plots from you timings:

versioninfo()

Commit f23fc0d27a* (2021-10-20 12:45 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: AMD Ryzen Threadripper 3970X 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
1 Like

Thank you @pablosanjose for trying this in your machine. Bellow I show the plot produced from your data:

julia> versioninfo()
Julia Version 1.7.0-rc1
Commit 9eade6195e (2021-09-12 06:45 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

The weirdets thing to my eyes, is that there is hardly any gain, between the times_csr_mul (which uses the default non-threaded mul! exported by SparseMatricesCSR.jl) vs times_csr_bmul (which is multithreaded).

Any idea of what might be going on?

Uhm, it might be that I didn’t set the number of threads? Will try later…

1 Like

Since PyPlot seems to be not working in 1.7, I have added a version of the benchmark code that produces the same plot with Plots.jl. I am not very good with Plots.jl, but I the plots produced should be close enough to the PyPlot one.

Here are the results starting Julia with julia -t 4 (4 threads, one per core in my Intel i7). They are very impressive!

By the way, starting like this Threads.nthreads()==4, but ThreadedSparseCSR.get_num_threads()==1, which is unintended according to your post. I had to do ThreadedSparseCSR.set_num_threads(4).

julia> times_csc
5×3 Matrix{Float64}:
     5.52967     3.58575    3.43275
    60.365      19.796      8.848
   273.243      43.523     17.355
 17152.3      1245.45     187.393
 75440.2      7654.11     698.591
julia> times_csc_transpose
5×3 Matrix{Float64}:
     2.94822     1.4767    1.2294
    43.505       7.5904    3.00267
   212.542      24.474     7.60025
 16140.1       950.387   151.934
 67185.0      7013.24    525.274
julia> times_csr_mul
5×3 Matrix{Float64}:
     9.91           1.95089     1.4712
   201.058         44.019      15.546
  1077.71         119.358      57.592
 35901.3         3678.85      495.329
     1.73359e5  24148.3      1937.85
julia> times_csr_bmul
5×3 Matrix{Float64}:
     3.7035     0.918679    0.744261
    62.031     15.692       3.08637
   287.894     41.742       9.989
 16198.6     1065.2       174.091
 68232.1     6457.15      551.48
julia> times_csr_tmul
5×3 Matrix{Float64}:
    15.536     4.25329    2.98871
    75.322    20.351     16.797
   285.905    53.267     22.335
 16394.3    1039.09     161.025
 69065.1    7003.6      597.733
2 Likes

Indeed, those results look quite competetive! (I am also surprised that for MKL, mat*vec is so close to transpose(mat)*vec)

Thanks for pointing that out!
Hmm… It seems that ThreadedSparseCSR.get_num_threads() gets stuck on the value of Threads.nthreads() at the time ThreadedSparseCSR gets precompiled the first time. In my case, I get ThreadedSparseCSR.get_num_threads()=8, even when I start julia with a single thread.

This was not the intended behaviour! I will update the benchmark and example usage codes with a line ThreadedSparseCSR.set_num_threads(Threads.nthreads()). But in a future version, I will try for that to the automatic (which was my original intention).

You could make

matmul_num_threads = Ref{Union{Nothing, Int64}}(nothing)

and write

function get_num_threads()
    if isnothing(matmul_num_threads[])
        matmul_num_threads[] = Threads.nthreads()
    end
    return matmul_num_threads[]
end

This way, the lookup would only happen the first time get_num_threads get called. Otherwise, you could probably put the matmul_num_threads assignment into the package’s init function.

2 Likes

Yeah, I noticed that some time back after a comment by @kristoffer.carlsson in a thread. There is some magic going on there. I think OpenBLAS doesn’t behave like that (but I might be wrong, I didn’t check recently)

Thank you for your suggestion, @carstenbauer !

That sounds like what I would want to do, yes! But I confess my ignorance redarding what a package’s init function is. Guess I will have to learn about that

Defining an __init__() function as:

matmul_num_threads = Ref(1)

function __init__()
    set_num_threads(Threads.nthreads())
end

does the trick! I will make a new release with this fix. Thank you once again @carstenbauer !

1 Like

The new patch version 0.1.1 includes this fix!

1 Like