[ANN] ThreadedDenseSparseMul.jl


Threaded implementation of dense-sparse matrix multiplication, built on top of Polyester.jl.

Hello all, I’ve recently had a need to do C \leftarrow C - D * S very fast, where D and S are dense and sparse matrices, respectively. However:

  • The SparseArrays.jl package doesn’t support threaded multiplication.
  • The IntelMKL.jl package doesn’t seem to support dense sparsecsc multiplication, although one can get similar performance using that package and transposing appropriately. It also comes with possible licensing issues and is vendor-specific.
  • The ThreadedSparseCSR.jl package also just supports sparsecsr dense.
  • The ThreadedSparseArrays.jl package also just supports ThreadedSparseMatrixCSC dense, and also doesn’t install for me currently.

I haven’t found an implementation for that, so made one myself. In fact, the package Polyester.jl makes this super easy, the entire code is basically

import SparseArrays: SparseMatrixCSC, mul!; import SparseArrays
import Polyester: @batch

function SparseArrays.mul!(C::AbstractMatrix, A::AbstractMatrix, B::SparseMatrixCSC, α::Number, β::Number)
    @batch for j in axes(B, 2)
        C[:, j] .*= β
        C[:, j] .+= A * (α.*B[:, j])
    return C

and therefore 95% of the credit should go to @Elrod .

However, this simple implementation beats SparseArrays.jl by about 2x (or even 3-4x on a good day) on my problem sizes, and significantly outperforms MKLSparse.jl, which only supports sparse x dense and therefore has to use transposes. See some more benchmark results in the Readme.

This package is currently performing type piracy only for the dispatch above. I’ve considered wrapping the computation into another type, but actually I believe it’s fine to load this package just like one would load MKLSparse.jl etc, see some discussion and here and here. However, I’m happy to hear other thoughts.

I hope this package will be useful to others!