Slow multiplication of transpose of sparse matrix

I’m having speed issues multiplying the transpose of a sparse matrix with a column vector.

In my code the matrix A is

501×501 SparseMatrixCSC{Float64, Integer} with 1501 stored entries:

⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸

⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸

⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸

⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⠀⠀⢸

⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⠀⠀⢸

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⠀⠀⢸

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⠀⠀⢸

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⣦⡀⢸

⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠈⠻⣾

These are the results I get from the multiplication with f0 = rand(Float64,501,1):

Method 1

A_tr = transpose(A)

@benchmark A_tr*f

BenchmarkTools.Trial: 10000 samples with 1 evaluation.

Range (min … max): 350.083 μs … 9.066 ms ┊ GC (min … max): 0.00% … 95.44%

Time (median): 361.208 μs ┊ GC (median): 0.00%

Time (mean ± σ): 380.269 μs ± 355.997 μs ┊ GC (mean ± σ): 4.06% ± 4.15%

Memory estimate: 218.70 KiB, allocs estimate: 11736.

Method 2

A_tr = Matrix(transpose(A))

@benchmark A_tr*f

BenchmarkTools.Trial: 10000 samples with 1 evaluation.

Range (min … max): 87.375 μs … 210.875 μs ┊ GC (min … max): 0.00% … 0.00%

Time (median): 88.542 μs ┊ GC (median): 0.00%

Time (mean ± σ): 89.286 μs ± 3.266 μs ┊ GC (mean ± σ): 0.00% ± 0.00%

Memory estimate: 4.06 KiB, allocs estimate: 1.

Method 3

A_tr = sparse(Matrix(transpose(A)))

@benchmark A_tr*f

BenchmarkTools.Trial: 10000 samples with 9 evaluations.

Range (min … max): 2.102 μs … 1.017 ms ┊ GC (min … max): 0.00% … 99.40%

Time (median): 2.477 μs ┊ GC (median): 0.00%

Time (mean ± σ): 2.725 μs ± 13.428 μs ┊ GC (mean ± σ): 6.92% ± 1.41%

Memory estimate: 4.06 KiB, allocs estimate: 1.

Why doesn’t Method 1 produce a similar performance as Method 3? I’m probably missing something basic here.

Thank you for your help!

Cross-post: julia - Slow multiplication of transpose of sparse matrix - Stack Overflow

Welcome to the Julia community.

While I am not sure I can help you much, I wanted to comment on what Karpinski wrote in a comment on Stack Overflow:

Not an answer, but transposing a sparse matrix is not a trivial operation. Once the transpose is computed, the multiplication should be fast.

Yes, transposing a sparse (or dense) matrix is non-trivial/costly (if Julia would actually do that for you), but what I found to be most awesome in Julia is the lazy transpose which is free and (now) the default.

I thought he might be saying a lazy transpose (or adjoint) isn’t available for sparse matrices, but I confirmed it is.

julia> transpose(sparse(A))  # also lazy for adjoint, syntax I believe you can use: sparse(A)'
3×3 Transpose{Float64, SparseMatrixCSC{Float64, Int64}} with 3 stored entries

I thought the point of this free lazyness, was that subsequent operations could be fast, or at least all combined, but maybe it hasn’t been exploited for all operations,e.g. not yet done for multiplication (all such operations need to be implemented with special cases in case provided with the Transpose wrapper).

I’m not really up-to-speed on SparseMatrixCSR at (or if it can help you): Home · SparseMatricesCSR.jl

Build a 1-based SparseMatrixCSR from the lazy transpose of a SparseMatrixCSC.
[…]
Build a Bi-based SparseMatrixCSR from the lazy transpose of a SparseMatrixCSC.

Which version of Julia are you using? I get this results

julia> using LinearAlgebra, SparseArrays

julia> M = sprand(500, 500, 0.006); # uniform random pattern

julia> DM = sparse(diagm(0 => rand(500), -1 => rand(499), 1 => rand(499))); # tridiagonal pattern

julia> b = rand(500, 1);

julia> tM = transpose(M);

julia> stM = sparse(transpose(M));

julia> tDM = transpose(DM);

julia> stDM = sparse(transpose(DM));

julia> @benchmark $tM*$b
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.498 μs … 244.536 μs  ┊ GC (min … max): 0.00% … 92.85%
 Time  (median):     2.785 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.045 μs ±   5.041 μs  ┊ GC (mean ± σ):  3.40% ±  2.05%

  ▆▇███▇▆▁▁▁▂▁                                                ▂
  ████████████▇▆▇█▇▇▆▇▇██▇▆▇▇██▇▆▆▇█▇▆▆▅▆▅▅▄▂▅▄▄▅▅▄▄▃▂▄▃▂▃▄▄▅ █
  2.5 μs       Histogram: log(frequency) by time      6.96 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> @benchmark $stM*$b
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.831 μs … 263.966 μs  ┊ GC (min … max): 0.00% … 90.33%
 Time  (median):     4.282 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.792 μs ±   6.691 μs  ┊ GC (mean ± σ):  3.38% ±  2.40%

  ▂▅▆▇█▆▄▂▁▁ ▁  ▁▁ ▁ ▂▁▁  ▁ ▁▁▁ ▂▁                            ▂
  █████████████████████████████████▆▅▅▅▄▅▄▅▅▃▆▆▄▅▄▄▄▄▄▃▄▄▄▄▅▅ █
  3.83 μs      Histogram: log(frequency) by time       9.7 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> @benchmark $tDM*$b
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.612 μs … 228.097 μs  ┊ GC (min … max): 0.00% … 90.08%
 Time  (median):     2.779 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.061 μs ±   4.977 μs  ┊ GC (mean ± σ):  3.33% ±  2.04%

    ▇█▇▆▄▃▂                                                   ▂
  ▆███████████▇███▇▆▇▇█▇▇▇▅▆▆▅▆▅▅▅▅▆▆▆█▇▇▇▇▆▇▆▆▇▇▇▇▇▇▇▇▇▆▆▇▅▆ █
  2.61 μs      Histogram: log(frequency) by time      5.14 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> @benchmark $stDM*$b
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.071 μs … 254.804 μs  ┊ GC (min … max): 0.00% … 88.67%
 Time  (median):     3.367 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.252 μs ±   5.283 μs  ┊ GC (mean ± σ):  2.45% ±  2.03%

   ▄█▇▆▃▁▁▁▁▂▂      ▁▁▁▁ ▁ ▁▁▁▁▁                  ▁▃▃▂▃▂▁▁    ▂
  ▇██████████████████████████████▇▆▆▅▆▅▄▄▄▅▅▄▄▃▄▆▅████████▇▅▅ █
  3.07 μs      Histogram: log(frequency) by time      8.05 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 1.

julia> @benchmark transpose($M)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.879 ns … 39.480 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.993 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.088 ns ±  0.651 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▇▇ ▄▃ ▁▂  ▅                                              ▂
  █▇██▇██▁██▁▄█▆▄▆█▇▆▆▆▆▆▆▆▆▆▆▆▆▆▇▆▆▅▅▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▇▆▇▆▅▆▅ █
  2.88 ns      Histogram: log(frequency) by time     5.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sparse(transpose($M))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.795 μs …  1.440 ms  ┊ GC (min … max): 0.00% … 88.11%
 Time  (median):     10.711 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.932 μs ± 33.937 μs  ┊ GC (mean ± σ):  6.48% ±  2.27%

   ▃▆██▇▆▅▄▃▂▁▂▂▁                                             ▂
  ▆███████████████▇▇▆▇▅▆▇▆▆▆▆▇▅▅▆▅▅▆▅▄▅▆▂▅▄▄▅▃▃▃▄▃▅▂▃▂▄▄▄▄▄▄▃ █
  9.8 μs       Histogram: log(frequency) by time      21.1 μs <

 Memory estimate: 28.69 KiB, allocs estimate: 8.

julia> @benchmark transpose($DM)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.880 ns … 32.195 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.004 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.165 ns ±  0.707 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▇█▃▅▃▁▅▁ ▁                                               ▂
  █████████████▇████▇██▇█▇█▇█▇█▇▆▇▇▆▇▇▆▆▆▇▆▆▆▆▆▆▅▆▆▅▅▅▅▅▆██▇ █
  2.88 ns      Histogram: log(frequency) by time     5.28 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sparse(transpose($DM))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.219 μs … 493.087 μs  ┊ GC (min … max): 0.00% … 82.39%
 Time  (median):     7.761 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.171 μs ±  17.514 μs  ┊ GC (mean ± σ):  8.24% ±  4.25%

  ▂▇█▇▆▄▂▂▂▂▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁                                  ▂
  █████████████████████████████▇▇▆▆▆▇▆▇▆▅▄▅▃▆▆▆▄▃▅▁▄▄▃▃▄▄▅▅▄▆ █
  7.22 μs      Histogram: log(frequency) by time      17.8 μs <

 Memory estimate: 27.94 KiB, allocs estimate: 8.

with versioninfo being

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

This looks weird to me, as the timing with transpose wrapper is better than without it, for both patterns of sparsity. :sweat_smile:

Also, you didn’t interpolate the variables, to use properly @benchmark you need to interpolate the global variables. You can read the documentation about this here. Specially this is the important part

Without this, the benchmarks might not be representative. But in my computer, not interpolating doesn’t seem to make any difference in this case. :thinking:

Integer is an abstract type, so working with this is going to be slow. How did you construct this? (See the performance tip: Avoid containers with abstract types.)

If you use concrete types it should be way faster, with or without the transpose:

julia> using BenchmarkTools, SparseArrays

julia> A = sprand(501,501,0.006); println(summary(A))
501×501 SparseMatrixCSC{Float64, Int64} with 1529 stored entries

julia> x = rand(501);  # note: no need for 501,1 — Julia has 1d arrays

julia> @btime $A * $x;
  2.543 μs (1 allocation: 4.06 KiB)

julia> @btime $(transpose(A)) * $x;
  2.626 μs (1 allocation: 4.06 KiB)

julia> @btime $A' * $x;  # shorter transpose for real A
  2.565 μs (1 allocation: 4.06 KiB)

PS. Please quote your code blocks with triple backticks.

2 Likes