(I’m on Julia 0.6.2)
julia> A = sprand(100,100,0.01);
julia> B = rand(100,100);
julia> using BenchmarkTools
julia> @btime $C = $A*$B;
40.106 μs (2 allocations: 78.20 KiB)
julia> @btime $C = $B*$A; # other way around
42.666 μs (2 allocations: 78.20 KiB) # similarily fast
julia> @btime A_mul_B!($C, $A, $B); # in-place version
34.560 μs (0 allocations: 0 bytes) # faster (as expected)
julia> @btime A_mul_B!($C, $B, $A); # other way around
2.118 ms (6 allocations: 336 bytes) # much slower! (unexpected)
Questions:
- Why is the speed of
A_mul_B!
so asymmetric?
- How can I do the multiplication
B*A
in-place?
My guesses:
- I guess this might be related to CSC format, is that right? I found out (@edit) that Julia dispatches
A*B
to A_mul_B!
but B*A
to (*)(X::StridedMatrix{TX}, A::SparseMatrixCSC{TvA,TiA}) where {TX,TvA,TiA}
in which it allocates: Y = zeros(promote_type(TX,TvA), mX, A.n)
.
- Based on my guess for 1) I assume the question is basically if there is a version of the
*
implementation mentioned in 1) that takes a preallocated Y
.
Thanks for any comments!
Yes. SparseMatrixCSC is only good when iterating down columns. It is terrible at iterating across rows. For matrix multiplication, A*B
, you use the columns of A
and the rows of B
. While B
is column-major when dense, you still know exactly where the that next piece of memory is since everything is the same size and lined up for an Array
. For a SparseMatrixCSC, you have to do a search to find out how far down in the next row that value would be, so it’s a much more costly operation.
SparseMatrixCSC is a great format for A*v
since that only uses the columns. It’s terrible for u*A
, and it can be much better to just create a new version of the sparse matrix if you need it (and Julia supports it).
2 Likes
The only problem here is that the method that is slow is not implemented for sparse matrices so it falls back to the generic one that is slow.
You can easily fix this yourself by just implementing that method with copy pasting the Base method:
function Base.A_mul_B!(Y::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
mX, nX = size(X)
nX == A.m || throw(DimensionMismatch())
fill!(Y, 0)
rowval = A.rowval
nzval = A.nzval
@inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
Y[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
end
Y
end
(edited to fill Y
with 0)
julia> @btime A_mul_B!($C, $B, $A); # other way around
28.732 μs (0 allocations: 0 bytes)
There is a PR up that does implement these in-place method and improve upon them: https://github.com/JuliaLang/julia/pull/24045/
2 Likes
Thanks for the comments! Nice to see this PR.
That’s exactly what I did. However, I had to add something like
z = zero(TY)
@inbounds for i in eachindex(Y)
Y[i] = z
end
because I can’t assume that Y is initalizied with zeros (I’m applying A_mul_B!
multiple times).
The issue still persists in Julia 1.0:
julia> @btime $C = $A*$B;
18.550 μs (2 allocations: 78.20 KiB)
julia> @btime $C = $B*$A; # other way around
20.097 μs (2 allocations: 78.20 KiB)
julia> @btime mul!($C,$A,$B);
14.531 μs (0 allocations: 0 bytes)
julia> @btime mul!($C,$B,$A);
825.506 μs (6 allocations: 336 bytes)
I hope this gets fixed soon.
For completeness, one can still fix it locally by defining
import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
mX, nX = size(X)
nX == A.m || throw(DimensionMismatch())
fill!(C, zero(eltype(C)))
rowval = A.rowval
nzval = A.nzval
@inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
end
C
end
Benchmark:
julia> @btime $C = $A*$B;
19.478 μs (2 allocations: 78.20 KiB)
julia> @btime $C = $B*$A;
22.261 μs (2 allocations: 78.20 KiB)
julia> @btime mul!($C,$A,$B);
16.077 μs (0 allocations: 0 bytes)
julia> @btime mul!($C,$B,$A);
18.241 μs (0 allocations: 0 bytes)
xref: https://github.com/JuliaLang/julia/issues/29956
UPDATE: Just in case someone finds this, an even faster version has been suggested here.
1 Like