What is the fastest way of realizing transpose of a CSR (or CSC) matrix?

I haven’t seen anything dramatic in terms of speedup on random matrices. Sometimes GB is faster, sometimes my own code (serial!) is faster.

I guess it also depends on whether it is CSC or CSR. Are GB matrices limited to CSC?

Why would it depend on this? Transposing CSC to CSC or CSR to CSR seem like exactly equivalent problems to me (because you can interpret a CSR matrix as a CSC matrix representing the transpose and vice versa).

No, it GB has explicit CSR support.

You are right, it would be, for square matrices. But for rectangular ones, the dimensions may matter a lot – the arrays you need to process have very different lengths.

Here are some data points. Let us call it unintuitive.

pkrysl@samadira sparse_transpose_tests.jl % julia --project t.jl
[ Info: Matrix M = 20133 by N = 20133, sparsity = 0.0004966969651815428, nnz = 201362
[ Info: Benchmarking copy+transpose CSC
  695.834 μs (12 allocations: 3.23 MiB)
[ Info: Benchmarking copy+transpose CSR
  1.893 s (3 allocations: 3.02 GiB)
[ Info: Benchmarking gbtranspose CSC
  823.708 μs (23 allocations: 3.69 MiB)
[ Info: Benchmarking gbtranspose CSR
  1.071 s (18 allocations: 3.02 GiB)
[ Info: Benchmarking csr_transpose
  663.541 μs (15 allocations: 3.53 MiB)
[ Info: Benchmarking csr_transpose_2
  631.500 μs (9 allocations: 3.23 MiB)
pkrysl@samadira sparse_transpose_tests.jl %

If you feel so inclined, test on your particular architecture.

Some more: Rectangular matrices.

[ Info: Matrix M = 20133 by N = 60399, sparsity = 0.0004966969651815428, nnz = 604295
[ Info: Benchmarking copy+transpose CSC
  1.987 ms (12 allocations: 9.37 MiB)
[ Info: Benchmarking copy+transpose CSR
  7.601 s (3 allocations: 9.06 GiB)
[ Info: Benchmarking gbtranspose CSC
  2.768 ms (29 allocations: 10.76 MiB)
[ Info: Benchmarking gbtranspose CSR
  3.505 s (18 allocations: 9.06 GiB)
[ Info: Benchmarking csr_transpose
  2.645 ms (15 allocations: 10.60 MiB)
[ Info: Benchmarking csr_transpose_2
  2.305 ms (9 allocations: 9.68 MiB)
pkrysl@samadira sparse_transpose_tests.jl % julia --project t2.jl
[ Info: Matrix M = 60133 by N = 20133, sparsity = 0.0001662980393461161, nnz = 201839
[ Info: Benchmarking copy+transpose CSC
  716.666 μs (12 allocations: 3.54 MiB)
[ Info: Benchmarking copy+transpose CSR
  3.836 s (3 allocations: 9.02 GiB)
[ Info: Benchmarking gbtranspose CSC
  1.047 ms (23 allocations: 4.92 MiB)
[ Info: Benchmarking gbtranspose CSR
  3.233 s (18 allocations: 9.02 GiB)
[ Info: Benchmarking csr_transpose
  784.334 μs (15 allocations: 3.54 MiB)
[ Info: Benchmarking csr_transpose_2
  768.125 μs (9 allocations: 3.23 MiB)

All of the timings on

julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

I’ve read the manual and searched it too. I still don’t know how to force the matrix to be either CSC or CSR. I believe the type is actually opaque in the Julia wrapper: it just prints GBMatrix.


[ Info: Matrix M = 2133 by N = 2133, sparsity = 0.0046882325363338025, nnz = 21356
(typeof(Ac), typeof(Ar)) = (SparseMatrixCSC{Float64, Int64}, SparseMatrixCSR{1, Float64, Int64})
(typeof(Gc), typeof(Gr)) = (GBMatrix{Float64, Float64}, GBMatrix{Float64, Float64})
norm(gbtranspose(Gc) - copy(transpose(Ac))) = 0.0
[ Info: Benchmarking copy+transpose CSC
  39.041 μs (12 allocations: 350.81 KiB)
[ Info: Benchmarking copy+transpose CSR
  22.142 ms (3 allocations: 34.71 MiB)
[ Info: Benchmarking gbtranspose CSC
  35.542 μs (21 allocations: 367.80 KiB)
[ Info: Benchmarking gbtranspose CSR
  5.583 ms (14 allocations: 34.71 MiB)
[ Info: Benchmarking csr_transpose
  45.125 μs (15 allocations: 384.31 KiB)
norm(csr_transpose(C) - copy(transpose(Ac))) = 0.0
[ Info: Benchmarking csr_transpose_2
  39.000 μs (9 allocations: 350.69 KiB)
norm(csr_transpose_2(C) - copy(transpose(Ac))) = 0.0

I tried

Gc = GBMatrix(Ac); # GraphBLAS sparse CSC copy
# setstorageorder!(Gc, ColMajor())
gbset(Gc, :format, :bycol)
Gr = GBMatrix(copy(sparsecsr(Ac))); # GraphBLAS sparse CSR copy
# setstorageorder!(Gr, RowMajor())
gbset(Gr, :format, :byrow)
@show typeof(Gc), typeof(Gr)

No change.

I admit being stymied atm. @rayegun Could you please look into it? The results are hard to believe, I may be making silly mistakes.

Also, SparseMatricesCSR timing is weird. @fverdugo ?

Edit:
I believe that copy(Ar') results in a dense matrix here:

Ar = copy(sparsecsr(Ac))
copy(Ar')

I wonder if this also happens in the GraphBlas case?

Hi @PetrKryslUCSD,

we never implemented the explicit transpose in SparseMatricesCSR.jl.

OK, I’ll put together a PR then. Thanks.

I’ll respond to the rest in a moment but typically random matrices are a poor choice for benchmarking, it is rare to come across a truly random matrix in the wild.

But indeed as mentioned I don’t think you will find a method which is uniformly fastest. The transpose for GraphBLAS may be incorrectly tuned for certain matrices and I’ll look into that.

1 Like

You use setstorageorder! like you did here, but I wouldn’t expect dramatically different results. You may use gbset(:burble, true) to see more information about what is happening internally (like which method is chosen).

I will try to update your benchmarking code later today or this weekend.

GraphBLAS will never densify a matrix, indeed the whole point of GraphBLAS is that the structure is always preserved.

In certain instances you may end up with a bytemap matrix as the exact storage structure is up to the library.

Tested on:

julia> versioninfo()
Julia Version 1.12.5
Commit 5fe89b8ddc1 (2026-02-09 16:05 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 16 virtual cores)

See: GitHub - PetrKryslUCSD/sparse_transpose_tests.jl: Tests of julia sparse matrices · GitHub

Test results:

pkrysl@samadira sparse_transpose_tests.jl % julia --project t.jl
[ Info: #############  Testing diffusionmatrix with NREFINE = 6  ##############
[ Info: Matrix M = 1073409 by N = 1073409, nnz = 28533505, sparsity = 2.4764220265962163e-5
(typeof(Ac), typeof(Ar)) = (SparseMatrixCSC{Float64, Int64}, SparseMatrixCSR{1, Float64, Int64})
[ Info: Benchmarking copy+transpose CSC
  131.150 ms (12 allocations: 443.61 MiB)
[ Info: Benchmarking copy+transpose CSR
  131.089 ms (12 allocations: 443.61 MiB)
(typeof(Gc), typeof(Gr)) = (GBMatrix{Float64, Float64}, GBMatrix{Float64, Float64})
[ Info: Benchmarking gbtranspose CSC
  130.197 ms (15 allocations: 451.81 MiB)
[ Info: Benchmarking gbtranspose CSR
  130.118 ms (11 allocations: 451.81 MiB)
[ Info: Benchmarking csr_transpose
  135.728 ms (15 allocations: 460.02 MiB)
[ Info: Benchmarking csr_transpose_2
  131.838 ms (9 allocations: 443.61 MiB)
[ Info: #############  Testing incidencematrix with NREFINE = 6  ##############
[ Info: Matrix M = 1048576 by N = 1073409, nnz = 8388608, sparsity = 7.62939453125e-6
(typeof(Ac), typeof(Ar)) = (SparseMatrixCSC{Int64, Int64}, SparseMatrixCSR{1, Int64, Int64})
[ Info: Benchmarking copy+transpose CSC
  37.382 ms (12 allocations: 136.20 MiB)
[ Info: Benchmarking copy+transpose CSR
  19.695 ms (12 allocations: 136.20 MiB)
(typeof(Gc), typeof(Gr)) = (GBMatrix{Int64, Int64}, GBMatrix{Int64, Int64})
[ Info: Benchmarking gbtranspose CSC
  37.294 ms (15 allocations: 144.03 MiB)
[ Info: Benchmarking gbtranspose CSR
  18.356 ms (11 allocations: 144.41 MiB)
[ Info: Benchmarking csr_transpose
  23.009 ms (15 allocations: 152.61 MiB)
[ Info: Benchmarking csr_transpose_2
  20.213 ms (9 allocations: 136.20 MiB)
[ Info: #############  Testing adjacencymatrix with NREFINE = 6  ##############
[ Info: Matrix M = 1073409 by N = 1073409, nnz = 28533505, sparsity = 2.4764220265962163e-5
(typeof(Ac), typeof(Ar)) = (SparseMatrixCSC{Int64, Int64}, SparseMatrixCSR{1, Int64, Int64})
[ Info: Benchmarking copy+transpose CSC
  130.512 ms (12 allocations: 443.61 MiB)
[ Info: Benchmarking copy+transpose CSR
  130.432 ms (12 allocations: 443.61 MiB)
(typeof(Gc), typeof(Gr)) = (GBMatrix{Int64, Int64}, GBMatrix{Int64, Int64})
[ Info: Benchmarking gbtranspose CSC
  129.183 ms (15 allocations: 451.81 MiB)
[ Info: Benchmarking gbtranspose CSR
  129.114 ms (11 allocations: 451.81 MiB)
[ Info: Benchmarking csr_transpose
  135.163 ms (15 allocations: 460.02 MiB)
[ Info: Benchmarking csr_transpose_2
  131.472 ms (9 allocations: 443.61 MiB)

Observations: no big surprises, no big winners.

Caveat: Creating the GB matrix this way

Gr = GBMatrix(copy(sparsecsr(Ac))); # GraphBLAS sparse CSR copy
# setstorageorder!(Gr, RowMajor())
gbset(Gr, :format, :byrow)

makes something very wrong indeed: the matrix takes forever to create and also to test.

Edit: I think I understand the problem now: GB does not know about SparseMatricesCSR, and so it creates a dense matrix. Or maybe it attempts to sparsify it. So, no mystery after all. But thanks for offering to assist, @rayegun !