For a sparse matrix stored in CSC format, without using MKLSparse, left multiplying it by a dense matrix is faster than right multiplying it by a dense matrix. This is quite intuitive because the sparse matrix is in CSC, and Julia stores dense matrices in the column-major order. Here is one benchmark example.

```
using LinearAlgebra, SparseArrays
using BenchmarkTools
BLAS.set_num_threads(1)
n = 5000
m = 30000
r = 20
ratio = m / n^2
B = sprand(n, n, ratio)
SymB = max.(B, B')
R = randn(n, r)
Rt = Matrix(R')
C = zeros(n, r)
Ct = zeros(r, n)
```

The result is

```
julia> @btime mul!($C, $SymB, $R)
@btime mul!($Ct, $Rt, $SymB)
2.481 ms (0 allocations: 0 bytes)
1.132 ms (0 allocations: 0 bytes)
julia> @which mul!(C, SymB, R)
mul!(C, A, B)
@ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275
julia> @which mul!(C, Rt, SymB)
mul!(C, A, B)
@ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275
```

There is some significant speed difference. MKLSparse greatly speeds up the sparse-dense matrix multiplications, but dense-sparse’s performance stays the same as it’s not implemented. Here are some simple results.

```
julia> using MKLSparse
@btime mul!($C, $SymB, $R)
@btime mul!($Ct, $Rt, $SymB)
440.872 μs (0 allocations: 0 bytes)
1.118 ms (0 allocations: 0 bytes)
julia> @which mul!(C, SymB, R)
mul!(C::StridedMatrix{Float64}, adjA::SparseMatrixCSC{Float64, Int64}, B::StridedMatrix{Float64})
@ MKLSparse.BLAS ~/.julia/packages/MKLSparse/4LlPk/src/BLAS/level_2_3/matmul.jl:36
julia> @which mul!(Ct, Rt, SymB)
mul!(C, A, B)
@ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275
```

I understand it’s simple to speed up the dense-sparse multiplications by performing transposes and then sparse-dense multiplications. I wonder whether there is still any possibility to let dense-sparse faster than sparse-dense as we can see previously dense-sparse is about 2x faster.