What is the fastest way of realizing transpose of a CSR (or CSC) matrix?

PetrKryslUCSD · March 17, 2026, 5:15pm

What is the fastest way of realizing sparse matrix transpose? I don’t mean a lazy transpose, I mean realization such that typeof(CSC') === typeof(CSC) (and analogously for CSR).

There seems to be the algorithm of Gustavson 1978. But this doesn’t seem to have been coded in Julia yet, AFAICT.

Any of the sparse-algebra experts would know?

fgerick · March 17, 2026, 5:33pm

Can you not just do this:

julia> a=sprandn(100,100,1/100);

julia> typeof(sparse(a')) === typeof(a)
true

rayegun · March 17, 2026, 6:11pm

I would look in: GraphBLAS/Source/transpose at stable · DrTimothyAldenDavis/GraphBLAS · GitHub. Be aware that code is Apache licensed. But the methods in there are almost certainly the fastest for general CSC/CSR matrices.

The code is a little complicated but not terrible, if you need help I’m fairly familiar with it. I believe one of those codes is the method from Gustavson but it’s been a while since I looked into it.

Regardless there is a merge sort method and a bucket-sort method with some variants.

That code is wrapped in SuiteSparseGraphBLAS.jl if you want to try it out before translating it to Julia. I haven’t released updates to it in a while but intend to do a new release soon and can answer any questions on it.

langestefan · March 17, 2026, 6:56pm

is SparseArrays.ftranspose! not fast enough?

julia> using SparseArrays

julia> A = sparse([1, 2, 3, 1, 4], [1, 2, 3, 5, 6], [10.0, 20.0, 30.0, 50.0, 60.0], 4, 6)
4×6 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 10.0    ⋅     ⋅    ⋅   50.0    ⋅ 
   ⋅   20.0    ⋅    ⋅     ⋅     ⋅ 
   ⋅     ⋅   30.0   ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅    ⋅     ⋅   60.0

julia> X = spzeros(Float64, 6, 4)   
6×4 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 
  ⋅    ⋅    ⋅    ⋅ 

julia> SparseArrays.ftranspose!(X, A, identity)
6×4 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 10.0    ⋅     ⋅     ⋅ 
   ⋅   20.0    ⋅     ⋅ 
   ⋅     ⋅   30.0    ⋅ 
   ⋅     ⋅     ⋅     ⋅ 
 50.0    ⋅     ⋅     ⋅ 
   ⋅     ⋅     ⋅   60.0

stevengj · March 17, 2026, 7:06pm

(Note that this is what copy(transpose(A)) calls, if you don’t care about pre-allocated output.)

This is, in fact, implemented in the Julia SparseArrays stdlib, and it’s what ftranspose! (and hence copy(transpose(A))) uses, via the (internal/non-public) SparseArrays.halfperm! function, whose docstring says:

This method implements the HALFPERM algorithm described in F. Gustavson, “Two fast algorithms for sparse matrices: multiplication and permuted transposition,” ACM TOMS 4(3), 250-269 (1978). The algorithm runs in O(size(A, 1), size(A, 2), nnz(A)) time and requires no space beyond that passed in.

stevengj · March 17, 2026, 7:16pm

It seems to be about the same speed as the SparseArrays code for a random sparse matrix:

julia> using SparseArrays, SuiteSparseGraphBLAS, BenchmarkTools

julia> A = sprand(10^4, 10^4, 1e-3);

julia> G = GBMatrix(A); # GraphBLAS sparse CSC copy

julia> gbtranspose(G) == copy(transpose(A)) # check
true

julia> @btime copy(transpose($A));       # native Julia transpose
  210.833 μs (12 allocations: 1.66 MiB)

julia> @btime gbtranspose($G);           # GraphBLAS transpose
  200.708 μs (15 allocations: 1.75 MiB)

(5% difference.)

If I try a more structured matrix, e.g. a tridiagonal matrix, then GraphBLAS is about 30% faster:

julia> A = spdiagm(0 => ones(10^4), 1 => fill(2,10^4-1), -1 => fill(3,10^4-1));

julia> G = GBMatrix(A);

julia> gbtranspose(G) == copy(transpose(A))
true

julia> @btime copy(transpose($A));
  49.125 μs (12 allocations: 608.30 KiB)

julia> @btime gbtranspose($G); 
  38.042 μs (15 allocations: 704.45 KiB)

So it’s a measurable speedup, but not overwhelming.

rayegun · March 17, 2026, 7:33pm

That’s a somewhat small matrix with random sparsity both of which are somewhat limited ways to benchmark (but still valuable). I would expect a more dramatic difference with matrices that end up using threads in GraphBLAS.jl, (which is unavailable in SparseArrays.jl). If your matrices aren’t big enough to make that valuable then indeed I don’t think there are any methods appreciably faster than halfperm.

In the end I always recommend users benchmark with their own data, there is rarely a uniformly better algorithm for sparse operations.

stevengj · March 17, 2026, 8:10pm

I’m not seeing a speedup from threads for a much larger (random) matrix:

julia> A = sprand(10^8, 10^8, 1e-8); G = GBMatrix(A);

julia> @btime copy(transpose($A));
  2.362 s (12 allocations: 2.24 GiB)

julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=1));
  2.001 s (15 allocations: 2.98 GiB)

julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=2));
  2.454 s (16 allocations: 3.73 GiB)

(this is on a fairly lightweight Apple M4 with 4 performance cores, YMMV).

rayegun · March 17, 2026, 9:51pm

For various matrices taken from SuiteSparse I get numbers like:

julia> @btime gbtranspose($A); # 4 threads
  652.144 ms (11 allocations: 2.50 GiB)

julia> gbset(:nthreads, 1); @btime gbtranspose($A);
  301.699 ms (11 allocations: 2.50 GiB)

julia> gbset(:nthreads, 2); @btime gbtranspose($A);
  956.217 ms (12 allocations: 2.64 GiB)

julia> @btime copy(transpose($B));
  1.286 s (12 allocations: 2.64 GiB)

Indeed it looks like at least on M-series chips that threading for the transpose has limited use-cases. But there are real-world cases where gbtranspose comes out ~4x faster (at one thread ;)). I suspect transpose is fairly hard to tune and explicit transposes aren’t super necessary in things like mxm as we can typically use them without the intermediate transpose result.

@DrTimothyAldenDavis might have some commentary (note Tim I confirmed most of these are running bucket in either atomic or non-atomic mode)

rayegun · March 17, 2026, 11:36pm

The GraphBLAS implementation is threaded by default. But it seems under a certain size (at least) that the threading is actually harmful to performance due to contention or other overheads. But at one thread SuiteSparse code is still much faster than the Julia code

DrTimothyAldenDavis · March 18, 2026, 1:04pm

The “bucket transpose” method I have is parallel but it’s difficult to get good parallel performance out of it. It’s not supposed to slow down though; I try to limit the # of threads so it doesn’t do that, with a heuristic. I see for this matrix, my heuristic is giving it to many threads. Which matrix is it? Which CPU is it on?

My builder-based transpose is much easier to parallelize so perhaps my heuristic should pick that method.

On the GPU, I’m seeing up to 12x performance gain from a parallel CPU transpose to the GPU transpose, by the way. That’s on a Volta V100.

PetrKryslUCSD · March 18, 2026, 1:52pm

Wait, why is there a huge difference btw 1-thread GB and copy+transpose? Above, we could see numerous examples that illustrated a lack of difference btw those two…

PetrKryslUCSD · March 18, 2026, 1:54pm

I could. I do. But the question was: is that the fastest performance one could get?

PetrKryslUCSD · March 18, 2026, 1:59pm

You are a master of understatement.

PetrKryslUCSD · March 18, 2026, 2:10pm

Thanks for looking into this!

There is no mention of “builder” in the user guide. Where does it live?

PetrKryslUCSD · March 18, 2026, 2:46pm

Does BLAS.set_num_threads(n) control the number of threads in GB?

PetrKryslUCSD · March 18, 2026, 2:59pm

Here is the complete algorithm of Gustavson 1978 (well, I wrote it from scratch). It is remarkable how concise Julia can be.


function realize_csr_transpose_2(IA, JA, VA, m, n)
    IAT = zeros(eltype(IA), n + 1)
    for k in eachindex(JA)
        IAT[JA[k]+1] += 1
    end
    ps = view(IAT, 2:length(IAT))
    sum = 1
    for i in 1:n
        c = ps[i]
        ps[i] = sum
        sum += c
    end
    JAT = similar(JA)
    VAT = similar(VA)
    for k in 1:m
        for j in IA[k]:IA[k+1]-1
            ti = JA[j]
            p = ps[ti]
            JAT[p] = k
            VAT[p] = VA[j]
            ps[ti] += 1
        end
    end
    IAT[1] = 1
    return IAT, JAT, VAT
end

function csr_transpose_2(A::SparseMatrixCSR)
    IAT, JAT, VAT = realize_csr_transpose_2(A.rowptr, A.colval, A.nzval, A.m, A.n)
    return SparseMatrixCSR{1}(A.n, A.m, IAT, JAT, VAT)
end

DrTimothyAldenDavis · March 18, 2026, 3:42pm

Sorry for the confusion. I mean the internal methods that provide the functionality of GrB_Matrix_build, which converts a COO format (list of row indices, column indices, and values, in any order) into CSR, CSC, hyper-sparse, etc. The user guide says nothing about the bucket vs builder distinction, so it would be in the user guide.

The methods in GrB_Matrix_build use a parallel merge sort of the COO tuples, taking O(e log e) time if there are e entries in the matrix. The bucket sort methods take O(e + n) time, but are harder to parallelize.

I use a heuristic to select the appropriate method, and also another one to select the # of threads to use. The # of threads can be controlled (as an upper bound) by the user application with GrB_set.

rayegun · March 18, 2026, 5:37pm

GraphBLAS is faster than SparseArrays.jl, and this might vary dramatically by matrix.

PetrKryslUCSD · March 18, 2026, 6:37pm

Ok, now I am confused. I believe that you pointed to Gustavson as (one of) the method used. How are the methods chosen? If it isn’t Gustavson’s, how does GB get so fast in the example you presented? It was faster by a factor of four! Can you point to the method used then?

Topic		Replies	Views
Slow multiplication of transpose of sparse matrix Performance	5	1026	April 27, 2022
CSC kills the prospect of multithreading. Shouldn't Julia use CSR? Julia at Scale sparse	47	3670	March 20, 2024
Using GraphBLAS (from SuiteSparse) to speed up Julia sparse matrix operations Internals & Design sparse	0	1320	May 31, 2021
Sparse transpose General Usage	8	2566	August 3, 2018
Function on matrix transpose and performance General Usage performance	11	8536	September 19, 2021

What is the fastest way of realizing transpose of a CSR (or CSC) matrix?

Related topics