What is the fastest way of realizing sparse matrix transpose? I don’t mean a lazy transpose, I mean realization such that typeof(CSC') === typeof(CSC) (and analogously for CSR).
There seems to be the algorithm of Gustavson 1978. But this doesn’t seem to have been coded in Julia yet, AFAICT.
Any of the sparse-algebra experts would know?
Can you not just do this:
julia> a=sprandn(100,100,1/100);
julia> typeof(sparse(a')) === typeof(a)
true
I would look in: GraphBLAS/Source/transpose at stable · DrTimothyAldenDavis/GraphBLAS · GitHub. Be aware that code is Apache licensed. But the methods in there are almost certainly the fastest for general CSC/CSR matrices.
The code is a little complicated but not terrible, if you need help I’m fairly familiar with it. I believe one of those codes is the method from Gustavson but it’s been a while since I looked into it.
Regardless there is a merge sort method and a bucket-sort method with some variants.
That code is wrapped in SuiteSparseGraphBLAS.jl if you want to try it out before translating it to Julia. I haven’t released updates to it in a while but intend to do a new release soon and can answer any questions on it.
1 Like
is SparseArrays.ftranspose! not fast enough?
julia> using SparseArrays
julia> A = sparse([1, 2, 3, 1, 4], [1, 2, 3, 5, 6], [10.0, 20.0, 30.0, 50.0, 60.0], 4, 6)
4×6 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
10.0 ⋅ ⋅ ⋅ 50.0 ⋅
⋅ 20.0 ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 30.0 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ 60.0
julia> X = spzeros(Float64, 6, 4)
6×4 SparseMatrixCSC{Float64, Int64} with 0 stored entries:
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅
julia> SparseArrays.ftranspose!(X, A, identity)
6×4 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
10.0 ⋅ ⋅ ⋅
⋅ 20.0 ⋅ ⋅
⋅ ⋅ 30.0 ⋅
⋅ ⋅ ⋅ ⋅
50.0 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ 60.0
(Note that this is what copy(transpose(A)) calls, if you don’t care about pre-allocated output.)
This is, in fact, implemented in the Julia SparseArrays stdlib, and it’s what ftranspose! (and hence copy(transpose(A))) uses, via the (internal/non-public) SparseArrays.halfperm! function, whose docstring says:
This method implements the HALFPERM algorithm described in F. Gustavson, “Two fast algorithms for sparse matrices: multiplication and permuted transposition,” ACM TOMS 4(3), 250-269 (1978). The algorithm runs in O(size(A, 1), size(A, 2), nnz(A)) time and requires no space beyond that passed in.
2 Likes
It seems to be about the same speed as the SparseArrays code for a random sparse matrix:
julia> using SparseArrays, SuiteSparseGraphBLAS, BenchmarkTools
julia> A = sprand(10^4, 10^4, 1e-3);
julia> G = GBMatrix(A); # GraphBLAS sparse CSC copy
julia> gbtranspose(G) == copy(transpose(A)) # check
true
julia> @btime copy(transpose($A)); # native Julia transpose
210.833 μs (12 allocations: 1.66 MiB)
julia> @btime gbtranspose($G); # GraphBLAS transpose
200.708 μs (15 allocations: 1.75 MiB)
(5% difference.)
If I try a more structured matrix, e.g. a tridiagonal matrix, then GraphBLAS is about 30% faster:
julia> A = spdiagm(0 => ones(10^4), 1 => fill(2,10^4-1), -1 => fill(3,10^4-1));
julia> G = GBMatrix(A);
julia> gbtranspose(G) == copy(transpose(A))
true
julia> @btime copy(transpose($A));
49.125 μs (12 allocations: 608.30 KiB)
julia> @btime gbtranspose($G);
38.042 μs (15 allocations: 704.45 KiB)
So it’s a measurable speedup, but not overwhelming.
2 Likes
That’s a somewhat small matrix with random sparsity both of which are somewhat limited ways to benchmark (but still valuable). I would expect a more dramatic difference with matrices that end up using threads in GraphBLAS.jl, (which is unavailable in SparseArrays.jl). If your matrices aren’t big enough to make that valuable then indeed I don’t think there are any methods appreciably faster than halfperm.
In the end I always recommend users benchmark with their own data, there is rarely a uniformly better algorithm for sparse operations.
I’m not seeing a speedup from threads for a much larger (random) matrix:
julia> A = sprand(10^8, 10^8, 1e-8); G = GBMatrix(A);
julia> @btime copy(transpose($A));
2.362 s (12 allocations: 2.24 GiB)
julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=1));
2.001 s (15 allocations: 2.98 GiB)
julia> @btime gbtranspose($G, desc=Descriptor(; nthreads=2));
2.454 s (16 allocations: 3.73 GiB)
(this is on a fairly lightweight Apple M4 with 4 performance cores, YMMV).
1 Like
For various matrices taken from SuiteSparse I get numbers like:
julia> @btime gbtranspose($A); # 4 threads
652.144 ms (11 allocations: 2.50 GiB)
julia> gbset(:nthreads, 1); @btime gbtranspose($A);
301.699 ms (11 allocations: 2.50 GiB)
julia> gbset(:nthreads, 2); @btime gbtranspose($A);
956.217 ms (12 allocations: 2.64 GiB)
julia> @btime copy(transpose($B));
1.286 s (12 allocations: 2.64 GiB)
Indeed it looks like at least on M-series chips that threading for the transpose has limited use-cases. But there are real-world cases where gbtranspose comes out ~4x faster (at one thread ;)). I suspect transpose is fairly hard to tune and explicit transposes aren’t super necessary in things like mxm as we can typically use them without the intermediate transpose result.
@DrTimothyAldenDavis might have some commentary (note Tim I confirmed most of these are running bucket in either atomic or non-atomic mode)
The GraphBLAS implementation is threaded by default. But it seems under a certain size (at least) that the threading is actually harmful to performance due to contention or other overheads. But at one thread SuiteSparse code is still much faster than the Julia code
The “bucket transpose” method I have is parallel but it’s difficult to get good parallel performance out of it. It’s not supposed to slow down though; I try to limit the # of threads so it doesn’t do that, with a heuristic. I see for this matrix, my heuristic is giving it to many threads. Which matrix is it? Which CPU is it on?
My builder-based transpose is much easier to parallelize so perhaps my heuristic should pick that method.
On the GPU, I’m seeing up to 12x performance gain from a parallel CPU transpose to the GPU transpose, by the way. That’s on a Volta V100.
1 Like
Wait, why is there a huge difference btw 1-thread GB and copy+transpose? Above, we could see numerous examples that illustrated a lack of difference btw those two…
I could. I do. But the question was: is that the fastest performance one could get?
1 Like
You are a master of understatement.
Thanks for looking into this!
There is no mention of “builder” in the user guide. Where does it live?
Does BLAS.set_num_threads(n) control the number of threads in GB?
Here is the complete algorithm of Gustavson 1978 (well, I wrote it from scratch). It is remarkable how concise Julia can be. 
function realize_csr_transpose_2(IA, JA, VA, m, n)
IAT = zeros(eltype(IA), n + 1)
for k in eachindex(JA)
IAT[JA[k]+1] += 1
end
ps = view(IAT, 2:length(IAT))
sum = 1
for i in 1:n
c = ps[i]
ps[i] = sum
sum += c
end
JAT = similar(JA)
VAT = similar(VA)
for k in 1:m
for j in IA[k]:IA[k+1]-1
ti = JA[j]
p = ps[ti]
JAT[p] = k
VAT[p] = VA[j]
ps[ti] += 1
end
end
IAT[1] = 1
return IAT, JAT, VAT
end
function csr_transpose_2(A::SparseMatrixCSR)
IAT, JAT, VAT = realize_csr_transpose_2(A.rowptr, A.colval, A.nzval, A.m, A.n)
return SparseMatrixCSR{1}(A.n, A.m, IAT, JAT, VAT)
end
1 Like
Sorry for the confusion. I mean the internal methods that provide the functionality of GrB_Matrix_build, which converts a COO format (list of row indices, column indices, and values, in any order) into CSR, CSC, hyper-sparse, etc. The user guide says nothing about the bucket vs builder distinction, so it would be in the user guide.
The methods in GrB_Matrix_build use a parallel merge sort of the COO tuples, taking O(e log e) time if there are e entries in the matrix. The bucket sort methods take O(e + n) time, but are harder to parallelize.
I use a heuristic to select the appropriate method, and also another one to select the # of threads to use. The # of threads can be controlled (as an upper bound) by the user application with GrB_set.
1 Like
GraphBLAS is faster than SparseArrays.jl, and this might vary dramatically by matrix.
Ok, now I am confused. I believe that you pointed to Gustavson as (one of) the method used. How are the methods chosen? If it isn’t Gustavson’s, how does GB get so fast in the example you presented? It was faster by a factor of four! Can you point to the method used then?