What is the fastest way of realizing sparse matrix transpose? I don’t mean a lazy transpose, I mean realization such that typeof(CSC') === typeof(CSC) (and analogously for CSR).
There seems to be the algorithm of Gustavson 1978. But this doesn’t seem to have been coded in Julia yet, AFAICT.
Any of the sparse-algebra experts would know?
Can you not just do this:
julia> a=sprandn(100,100,1/100);
julia> typeof(sparse(a')) === typeof(a)
true
I would look in: GraphBLAS/Source/transpose at stable · DrTimothyAldenDavis/GraphBLAS · GitHub. Be aware that code is Apache licensed. But the methods in there are almost certainly the fastest for general CSC/CSR matrices.
The code is a little complicated but not terrible, if you need help I’m fairly familiar with it. I believe one of those codes is the method from Gustavson but it’s been a while since I looked into it.
Regardless there is a merge sort method and a bucket-sort method with some variants.
That code is wrapped in SuiteSparseGraphBLAS.jl if you want to try it out before translating it to Julia. I haven’t released updates to it in a while but intend to do a new release soon and can answer any questions on it.
1 Like
is SparseArrays.ftranspose! not fast enough?
julia> using SparseArrays
julia> A = sparse([1, 2, 3, 1, 4], [1, 2, 3, 5, 6], [10.0, 20.0, 30.0, 50.0, 60.0], 4, 6)
4×6 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
10.0 ⋅ ⋅ ⋅ 50.0 ⋅
⋅ 20.0 ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 30.0 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ 60.0
julia> X = spzeros(Float64, 6, 4)
6×4 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
julia> SparseArrays.ftranspose!(X, A, identity)
6×4 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
10.0 ⋅ ⋅ ⋅
⋅ 20.0 ⋅ ⋅
⋅ ⋅ 30.0 ⋅
⋅ ⋅ ⋅ ⋅
50.0 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 60.0
(Note that this is what copy(transpose(A)) calls, if you don’t care about pre-allocated output.)
This is, in fact, implemented in the Julia SparseArrays stdlib, and it’s what ftranspose! (and hence copy(transpose(A))) uses, via the (internal/non-public) SparseArrays.halfperm! function, whose docstring says:
This method implements the HALFPERM algorithm described in F. Gustavson, “Two fast algorithms for sparse matrices: multiplication and permuted transposition,” ACM TOMS 4(3), 250-269 (1978). The algorithm runs in O(size(A, 1), size(A, 2), nnz(A)) time and requires no space beyond that passed in.
2 Likes
It seems to be about the same speed as the SparseArrays code for a random sparse matrix:
julia> using SparseArrays, SuiteSparseGraphBLAS, BenchmarkTools
julia> A = sprand(10^4, 10^4, 1e-3);
julia> G = GBMatrix(A); # GraphBLAS sparse CSC copy
julia> gbtranspose(G) == copy(transpose(A)) # check
true
julia> @btime copy(transpose($A)); # native Julia transpose
210.833 μs (12 allocations: 1.66 MiB)
julia> @btime gbtranspose($G); # GraphBLAS transpose
200.708 μs (15 allocations: 1.75 MiB)
(5% difference.)
If I try a more structured matrix, e.g. a tridiagonal matrix, then GraphBLAS is about 30% faster:
julia> A = spdiagm(0 => ones(10^4), 1 => fill(2,10^4-1), -1 => fill(3,10^4-1));
julia> G = GBMatrix(A);
julia> gbtranspose(G) == copy(transpose(A))
true
julia> @btime copy(transpose($A));
49.125 μs (12 allocations: 608.30 KiB)
julia> @btime gbtranspose($G);
38.042 μs (15 allocations: 704.45 KiB)
So it’s a measurable speedup, but not overwhelming.
2 Likes
That’s a somewhat small matrix with random sparsity both of which are somewhat limited ways to benchmark (but still valuable). I would expect a more dramatic difference with matrices that end up using threads in GraphBLAS.jl, (which is unavailable in SparseArrays.jl). If your matrices aren’t big enough to make that valuable then indeed I don’t think there are any methods appreciably faster than halfperm.
In the end I always recommend users benchmark with their own data, there is rarely a uniformly better algorithm for sparse operations.
I’m not seeing a speedup from threads for a much larger (random) matrix:
julia> A = sprand(10^8, 10^8, 1e-8); G = GBMatrix(A);
julia> @btime copy(transpose($A));
2.362 s (12 allocations: 2.24 GiB)
julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=1));
2.001 s (15 allocations: 2.98 GiB)
julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=2));
2.454 s (16 allocations: 3.73 GiB)
(this is on a fairly lightweight Apple M4 with 4 performance cores, YMMV).