What is the fastest way of realizing transpose of a CSR (or CSC) matrix?

What is the fastest way of realizing sparse matrix transpose? I don’t mean a lazy transpose, I mean realization such that typeof(CSC') === typeof(CSC) (and analogously for CSR).

There seems to be the algorithm of Gustavson 1978. But this doesn’t seem to have been coded in Julia yet, AFAICT.

Any of the sparse-algebra experts would know?

Can you not just do this:

julia> a=sprandn(100,100,1/100);

julia> typeof(sparse(a')) === typeof(a)
true

I would look in: GraphBLAS/Source/transpose at stable · DrTimothyAldenDavis/GraphBLAS · GitHub. Be aware that code is Apache licensed. But the methods in there are almost certainly the fastest for general CSC/CSR matrices.

The code is a little complicated but not terrible, if you need help I’m fairly familiar with it. I believe one of those codes is the method from Gustavson but it’s been a while since I looked into it.

Regardless there is a merge sort method and a bucket-sort method with some variants.

That code is wrapped in SuiteSparseGraphBLAS.jl if you want to try it out before translating it to Julia. I haven’t released updates to it in a while but intend to do a new release soon and can answer any questions on it.

1 Like

is SparseArrays.ftranspose! not fast enough?

julia> using SparseArrays

julia> A = sparse([1, 2, 3, 1, 4], [1, 2, 3, 5, 6], [10.0, 20.0, 30.0, 50.0, 60.0], 4, 6)
4×6 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 10.0    ⋅     ⋅    ⋅   50.0    ⋅ 
   ⋅   20.0    ⋅    ⋅     ⋅     ⋅ 
   ⋅     ⋅   30.0   ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅    ⋅     ⋅   60.0

julia> X = spzeros(Float64, 6, 4)   
6×4 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 

julia> SparseArrays.ftranspose!(X, A, identity)
6×4 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 10.0    ⋅     ⋅     ⋅ 
   ⋅   20.0    ⋅     ⋅ 
   ⋅     ⋅   30.0    ⋅ 
   ⋅     ⋅     ⋅     ⋅ 
 50.0    ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅   60.0

(Note that this is what copy(transpose(A)) calls, if you don’t care about pre-allocated output.)

This is, in fact, implemented in the Julia SparseArrays stdlib, and it’s what ftranspose! (and hence copy(transpose(A))) uses, via the (internal/non-public) SparseArrays.halfperm! function, whose docstring says:

This method implements the HALFPERM algorithm described in F. Gustavson, “Two fast algorithms for sparse matrices: multiplication and permuted transposition,” ACM TOMS 4(3), 250-269 (1978). The algorithm runs in O(size(A, 1), size(A, 2), nnz(A)) time and requires no space beyond that passed in.

2 Likes

It seems to be about the same speed as the SparseArrays code for a random sparse matrix:

julia> using SparseArrays, SuiteSparseGraphBLAS, BenchmarkTools

julia> A = sprand(10^4, 10^4, 1e-3);

julia> G = GBMatrix(A); # GraphBLAS sparse CSC copy

julia> gbtranspose(G) == copy(transpose(A)) # check
true

julia> @btime copy(transpose($A));       # native Julia transpose
  210.833 μs (12 allocations: 1.66 MiB)

julia> @btime gbtranspose($G);           # GraphBLAS transpose
  200.708 μs (15 allocations: 1.75 MiB)

(5% difference.)

If I try a more structured matrix, e.g. a tridiagonal matrix, then GraphBLAS is about 30% faster:

julia> A = spdiagm(0 => ones(10^4), 1 => fill(2,10^4-1), -1 => fill(3,10^4-1));

julia> G = GBMatrix(A);

julia> gbtranspose(G) == copy(transpose(A))
true

julia> @btime copy(transpose($A));
  49.125 μs (12 allocations: 608.30 KiB)

julia> @btime gbtranspose($G); 
  38.042 μs (15 allocations: 704.45 KiB)

So it’s a measurable speedup, but not overwhelming.

2 Likes

That’s a somewhat small matrix with random sparsity both of which are somewhat limited ways to benchmark (but still valuable). I would expect a more dramatic difference with matrices that end up using threads in GraphBLAS.jl, (which is unavailable in SparseArrays.jl). If your matrices aren’t big enough to make that valuable then indeed I don’t think there are any methods appreciably faster than halfperm.

In the end I always recommend users benchmark with their own data, there is rarely a uniformly better algorithm for sparse operations.

I’m not seeing a speedup from threads for a much larger (random) matrix:

julia> A = sprand(10^8, 10^8, 1e-8); G = GBMatrix(A);

julia> @btime copy(transpose($A));
  2.362 s (12 allocations: 2.24 GiB)

julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=1));
  2.001 s (15 allocations: 2.98 GiB)

julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=2));
  2.454 s (16 allocations: 3.73 GiB)

(this is on a fairly lightweight Apple M4 with 4 performance cores, YMMV).