How to speed up dense-sparse matrix multiplication where the sparse matrix is in CSC?

For a sparse matrix stored in CSC format, without using MKLSparse, left multiplying it by a dense matrix is faster than right multiplying it by a dense matrix. This is quite intuitive because the sparse matrix is in CSC, and Julia stores dense matrices in the column-major order. Here is one benchmark example.

using LinearAlgebra, SparseArrays
using BenchmarkTools

BLAS.set_num_threads(1)

n = 5000 
m = 30000 
r = 20 

ratio = m / n^2
B = sprand(n, n, ratio)
SymB = max.(B, B')

R = randn(n, r)
Rt = Matrix(R')

C = zeros(n, r)
Ct = zeros(r, n)

The result is

julia> @btime mul!($C, $SymB, $R)
       @btime mul!($Ct, $Rt, $SymB)
  2.481 ms (0 allocations: 0 bytes)
  1.132 ms (0 allocations: 0 bytes)

julia> @which mul!(C, SymB, R)
mul!(C, A, B)
     @ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275

julia> @which mul!(C, Rt, SymB)
mul!(C, A, B)
     @ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275

There is some significant speed difference. MKLSparse greatly speeds up the sparse-dense matrix multiplications, but dense-sparse’s performance stays the same as it’s not implemented. Here are some simple results.

julia> using MKLSparse
       @btime mul!($C, $SymB, $R)
       @btime mul!($Ct, $Rt, $SymB)
  440.872 μs (0 allocations: 0 bytes)
  1.118 ms (0 allocations: 0 bytes)

julia> @which mul!(C, SymB, R)
mul!(C::StridedMatrix{Float64}, adjA::SparseMatrixCSC{Float64, Int64}, B::StridedMatrix{Float64})
     @ MKLSparse.BLAS ~/.julia/packages/MKLSparse/4LlPk/src/BLAS/level_2_3/matmul.jl:36

julia> @which mul!(Ct, Rt, SymB)
mul!(C, A, B)
     @ LinearAlgebra /u/subspace_s4/software/julia-1.9.0/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:275

I understand it’s simple to speed up the dense-sparse multiplications by performing transposes and then sparse-dense multiplications. I wonder whether there is still any possibility to let dense-sparse faster than sparse-dense as we can see previously dense-sparse is about 2x faster.