Double checking there is no multi-threaded dense x sparse matmul implementation

I am in need of having very efficient dense x sparse matmul code, but it seems there is no threaded implementation of this available (only sparse x dense). I therefore wrote my own version, which achieves about 3x speedup over SparseArrays.jl (which is single-threaded), and will register it soon. However, I wanted to double check that I didn’t miss anything!

The packages I have considered (see Readme) are:

  • The SparseArrays.jl package doesn’t support threaded multiplication.
  • The IntelMKL.jl package doesn’t seem to support dense sparsecsc multiplication, although one can get similar performance using that package and transposing appropriately. It also comes with possible licensing issues and is vendor-specific.
  • The ThreadedSparseCSR.jl package also just supports sparsecsr dense.
  • The ThreadedSparseArrays.jl package also just supports ThreadedSparseMatrixCSC
    dense, and also doesn’t install for me currently.

In the plot, is StaticArrays supposed to be SparseArrays?

Also, overloading mul! like that is type piracy, unfortunately.

One option to try is @willow’s Finch.jl. I tried to give it a go with

using Finch

    A = Fiber!(Dense(Dense(Element(0.0))))
    B = Fiber!(DenseLevel(SparseList(Element(0.0))))
    C = Fiber!(Dense(Dense(Element(0.0))))

    def = @finch_kernel function finch_mul!(C, A, B)
        C .= 0
        for col = _, j = _, i = _
            C[i, col] += A[i, j] * B[j, col]

function fm!(C, A, B)
    A = fiber(A)
    B = fiber(B)
    C = fiber!(C)

    finch_mul!(C, A, B)
    return C

# from the ThreadedDenseSparseMul.jl benchmarks
N, K, M = 1_000, 2_000, 30_000
A = rand(N, K); B = sprand(K, M, 0.05); C = similar(A, (N, M))
fm!(C, A, B)

which indeed seems to do something, but it seemed a little slower than regular mul! (~0.8s to ~0.7s) and I couldn’t figure out how to make C into a regular Matrix again (and thus couldn’t check that it got the same answer as mul!).

BTW, according to Finch’s note on benchmarking it would be better to use MatrixDepot matrices rather than sprand in order to have more typical sparsity patterns.

Thanks for this package! How does it compare with MKLSparse.jl, if you use the transpose?

Recently I also had to role my own version of multithreaded sparse csc - dense matrix multiplication. On the one hand, it is really cool and it illustrates the power of julia and its ecosystem that one can role their own version of routines that are competitive with more established libraries with minimal effort. On the other hand, it would be nice if these multithreaded versions would be included in Julia base or at least consolidated into a single package.

Maybe this is one of the cases when one should “be the change you want to see in the world”.

Thanks for the feedback.

@ericphanson sorry for the silly typo, of course I meant comparing against SparseArrays.jl, not StaticArrays.jl (somehow my mind interchanges these – in fact I typo’d it again just now in this answer).

I wasn’t aware of Finch.j. it looks super interesting! I might get back to that if I have time. However, I do like the simplicity of my solution (where 99% of credit goes to @Elrod of course).

The matrix sizes and densities came from my own problem, but I agree if this would be used more widely, more thorough benchmarking e.g. with MatrixDepots.jl will be good, thanks for the pointer.

@Bruno_Amorim I have now included benchmarks against MKLSparse.jl which we outperform significantly, likely due to the column-major storage of julia.

@ericphanson On the topic of type piracy: As far as I understand, type piracy should be fine in this case, even if other libraries use this function.Quoting Jeff Bezanos:

Right, type piracy is defining function f(x::T) where you “own” neither f nor T. If f is your own function, or T is a type you defined, there’s no issue at all. But even if neither is true, it still might be OK. It’s just a smell that something bad might be happening, such as:
The author/designer of f really did not want it to support type T, so you’re misunderstanding what the function is supposed to mean.
The code is in the wrong place, and should be moved to where f or T is defined.
But if everything seems to be in order, pirate away.

Of course in this case there might be many other packages using dense x sparse mul, but (let’s imagine for a second this to be true) my implementation is IEEE compliant, and that the threading is fine even e.g. inside of other threading, that I should be fine pirating the mul! function. In fact, I believe other packages like MKLSparse.jl do the same.

Happy to hear your thoughts, however.

Yeah, this is definitely one of the more benign forms of type piracy, where you are computing the same thing and should be coming up with the same answer, only faster. However, type piracy still has a couple problems:

  1. It causes invalidations, meaning code elsewhere (that maybe has already been compiled) needs to be recompiled. This contributes to latency running functions for the first time (or the first time after loading your package).
  2. It is non-compositional. Only 1 function in the end will be executed, so if there’s 2 pirates of the same function, which one gets run might depend on package load-order, which can be confusing/problematic.

I agree, MKLSparse is doing the same.

In my opinion, packages like this are fine but should not be dependencies of other packages, and only loaded by the end-user or application. That way, the user gets to be the one to make the choice of which fast-threaded-spmv they want (by choosing which package to load). And since there can only be 1, it does not seem fair for a library to make that choice for a user.

I feel the same about MKL.jl, AppleAccelerate, etc- they can be great, but you can only choose one BLAS vendor, and it should be up to the user to choose, since they know their hardware and what they are executing.


Thanks for trying out the Finch package. I think we’ve fixed a lot of these issues, you should now be able to call SparseMatrixCSC on a finch fiber to convert back. Note that the new constructor for Finch tensors is called Tensor, and it mirrors base Array constructor but with an additional lvl storage argument. Let me know if you run into anything! Finch isn’t running multithreaded by default, so I wouldn’t be surprised if the runtime of mul! is performing similarly in this case, as Finch is generating similar code

1 Like