Asymmetric speed of in-place `sparse*dense` matrix product

(I’m on Julia 0.6.2)

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> using BenchmarkTools

julia> @btime $C = $A*$B;
  40.106 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  42.666 μs (2 allocations: 78.20 KiB) # similarily fast

julia> @btime A_mul_B!($C, $A, $B); # in-place version
  34.560 μs (0 allocations: 0 bytes) # faster (as expected)

julia> @btime A_mul_B!($C, $B, $A); # other way around
  2.118 ms (6 allocations: 336 bytes) # much slower! (unexpected)

Questions:

  1. Why is the speed of A_mul_B! so asymmetric?
  2. How can I do the multiplication B*A in-place?

My guesses:

  1. I guess this might be related to CSC format, is that right? I found out (@edit) that Julia dispatches A*B to A_mul_B! but B*A to (*)(X::StridedMatrix{TX}, A::SparseMatrixCSC{TvA,TiA}) where {TX,TvA,TiA} in which it allocates: Y = zeros(promote_type(TX,TvA), mX, A.n).
  2. Based on my guess for 1) I assume the question is basically if there is a version of the * implementation mentioned in 1) that takes a preallocated Y.

Thanks for any comments!

Yes. SparseMatrixCSC is only good when iterating down columns. It is terrible at iterating across rows. For matrix multiplication, A*B, you use the columns of A and the rows of B. While B is column-major when dense, you still know exactly where the that next piece of memory is since everything is the same size and lined up for an Array. For a SparseMatrixCSC, you have to do a search to find out how far down in the next row that value would be, so it’s a much more costly operation.

SparseMatrixCSC is a great format for A*v since that only uses the columns. It’s terrible for u*A, and it can be much better to just create a new version of the sparse matrix if you need it (and Julia supports it).

2 Likes

The only problem here is that the method that is slow is not implemented for sparse matrices so it falls back to the generic one that is slow.

You can easily fix this yourself by just implementing that method with copy pasting the Base method:

function Base.A_mul_B!(Y::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(Y, 0)
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        Y[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    Y
end

(edited to fill Y with 0)

julia> @btime A_mul_B!($C, $B, $A); # other way around
  28.732 μs (0 allocations: 0 bytes)

There is a PR up that does implement these in-place method and improve upon them: https://github.com/JuliaLang/julia/pull/24045/

2 Likes

Thanks for the comments! Nice to see this PR.

That’s exactly what I did. However, I had to add something like

    z = zero(TY)
    @inbounds for i in eachindex(Y)
      Y[i] = z
    end

because I can’t assume that Y is initalizied with zeros (I’m applying A_mul_B! multiple times).

Use fill!(Y, 0) instead.

The issue still persists in Julia 1.0:

julia> @btime $C = $A*$B;
  18.550 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 μs (6 allocations: 336 bytes)

I hope this gets fixed soon.

For completeness, one can still fix it locally by defining

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

Benchmark:

julia> @btime $C = $A*$B;
  19.478 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  22.261 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  16.077 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 μs (0 allocations: 0 bytes)

xref: https://github.com/JuliaLang/julia/issues/29956

UPDATE: Just in case someone finds this, an even faster version has been suggested here.

1 Like