Asymmetric speed of in-place `sparse*dense` matrix product

carstenbauer · April 10, 2018, 6:33am

(I’m on Julia 0.6.2)

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> using BenchmarkTools

julia> @btime $C = $A*$B;
  40.106 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  42.666 μs (2 allocations: 78.20 KiB) # similarily fast

julia> @btime A_mul_B!($C, $A, $B); # in-place version
  34.560 μs (0 allocations: 0 bytes) # faster (as expected)

julia> @btime A_mul_B!($C, $B, $A); # other way around
  2.118 ms (6 allocations: 336 bytes) # much slower! (unexpected)

Questions:

Why is the speed of A_mul_B! so asymmetric?
How can I do the multiplication B*A in-place?

My guesses:

I guess this might be related to CSC format, is that right? I found out (@edit) that Julia dispatches A*B to A_mul_B! but B*A to (*)(X::StridedMatrix{TX}, A::SparseMatrixCSC{TvA,TiA}) where {TX,TvA,TiA} in which it allocates: Y = zeros(promote_type(TX,TvA), mX, A.n).
Based on my guess for 1) I assume the question is basically if there is a version of the * implementation mentioned in 1) that takes a preallocated Y.

Thanks for any comments!

ChrisRackauckas · April 10, 2018, 6:53am

Yes. SparseMatrixCSC is only good when iterating down columns. It is terrible at iterating across rows. For matrix multiplication, A*B, you use the columns of A and the rows of B. While B is column-major when dense, you still know exactly where the that next piece of memory is since everything is the same size and lined up for an Array. For a SparseMatrixCSC, you have to do a search to find out how far down in the next row that value would be, so it’s a much more costly operation.

SparseMatrixCSC is a great format for A*v since that only uses the columns. It’s terrible for u*A, and it can be much better to just create a new version of the sparse matrix if you need it (and Julia supports it).

kristoffer.carlsson · April 10, 2018, 7:39am

The only problem here is that the method that is slow is not implemented for sparse matrices so it falls back to the generic one that is slow.

You can easily fix this yourself by just implementing that method with copy pasting the Base method:

function Base.A_mul_B!(Y::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(Y, 0)
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        Y[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    Y
end

(edited to fill Y with 0)

julia> @btime A_mul_B!($C, $B, $A); # other way around
  28.732 μs (0 allocations: 0 bytes)

There is a PR up that does implement these in-place method and improve upon them: https://github.com/JuliaLang/julia/pull/24045/

carstenbauer · April 10, 2018, 7:47am

Thanks for the comments! Nice to see this PR.

That’s exactly what I did. However, I had to add something like

    z = zero(TY)
    @inbounds for i in eachindex(Y)
      Y[i] = z
    end

because I can’t assume that Y is initalizied with zeros (I’m applying A_mul_B! multiple times).

kristoffer.carlsson · April 10, 2018, 7:56am

Use fill!(Y, 0) instead.

carstenbauer · November 6, 2018, 3:44pm

The issue still persists in Julia 1.0:

julia> @btime $C = $A*$B;
  18.550 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 μs (6 allocations: 336 bytes)

I hope this gets fixed soon.

carstenbauer · November 7, 2018, 4:03pm

For completeness, one can still fix it locally by defining

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

Benchmark:

julia> @btime $C = $A*$B;
  19.478 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  22.261 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  16.077 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 μs (0 allocations: 0 bytes)

carstenbauer · November 8, 2018, 8:35am

xref: https://github.com/JuliaLang/julia/issues/29956

UPDATE: Just in case someone finds this, an even faster version has been suggested here.

Topic		Replies	Views
How to speed up dense-sparse matrix multiplication where the sparse matrix is in CSC? Performance	0	155	March 28, 2024
[ANN] Fast SpMv with CompressedSparseBlocks.jl Package Announcements performance , linearalgebra , sparse	9	692	July 26, 2022
Speed comparison: Sparse matrix multiplication vs usual matrix multiplication General Usage question	7	3858	February 13, 2017
Performance discrepancy in sparse matrix product Performance	4	970	June 29, 2018
Slow sparse matrix-vector product with symmetric matrices General Usage	34	4368	July 28, 2019

Asymmetric speed of in-place `sparse*dense` matrix product

Related topics