Sparse matrix multiplication: SparseMatrixCSC can be ~100x slower than Base.SparseArrays.CHOLMOD.Sparse

performance

#1

I want to multiply a sparse matrix and its transpose. The sparse matrix is generated by the code sparse_diff.jl (available at http://modb.oce.ulg.ac.be/mediawiki/upload/Alex/Julia/sparse_diff.jl).

The following sparse matrix multiplication takes about 8.124084 seconds (2.99 M allocations: 312.158 MB, 1.59% gc time).

include("/path/to/sparse_diff.jl")
S = sparse_diff((1000,1000),1);
A = S*S';
@time A = S*S';

Here I am using julia 0.5.0. As a comparison, the equivalent code takes 0.119242 seconds in octave 3.8.1 and 0.122009 seconds in matlab (R2013a).

However, if the type Base.SparseArrays.CHOLMOD.Sparse is used, the sparse matrix multiplication takes only 0.086979 seconds (93 times faster than the plain sparse type).

S2 = Base.SparseArrays.CHOLMOD.Sparse(S);
A2 = S2*S2';
@time A2 = S2*S2';

The output of this snipped is: 0.086979 seconds (24 allocations: 99.061 MB, 1.28% gc time).
Even when the conversion is done as part of the matrix multiplication, it is still significantly faster:

@time A3 = sparse(Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(S)');
0.275268 seconds (49 allocations: 274.310 MB, 27.35% gc time)

In all cases the results are identical: sparse(A2) - A returns a sparse array with 0 nonzero entry.

Is there a reason for not using Cholmod ssmult per default?


Speed comparison: Sparse matrix multiplication vs usual matrix multiplication
#2

The CSC matrices have sorted storage, so they are faster in matrix-vector multiplication, which is the most frequent use of sparse matrices in many algorithms. Profiling shows that the great majority of the time in your S*S' is taken up by the sorting function which puts the result back into this form. The sorting scheme which is actually used by the CSC matrix code looks suboptimal, but improving it would require significant effort. (Hint to someone with more free time than I have: use a dedicated sortperm! based on RadixSort.)

Many algorithm formulations which appear to need S*S' can be rearranged to work with just S and S' separately. If you can do that, you get the benefits of CSC multiplication without the up-front delay.


#3

Thank you for your insight. Indeed, on my machine, 96% of the time is done for sorting the indices. Every column (or row) has just 3 non-zero values. Maybe the used sorting algorithm is not efficient for short arrays.


#4

I believe the CHOLMOD multiplication does not give sorted order for column indices, which is a must for the SparseMatrixCSC data structure.


#5

As far as I can tell, the index sorting is activated for the CHOLMOD multiplication, at least for the case A*B’ (where A != B):

https://github.com/JuliaLang/julia/blob/master/base/sparse/cholmod.jl#L1274
https://github.com/JuliaLang/julia/blob/master/base/sparse/cholmod.jl#L646

When I make a matrix multiplication for A != B, I get the following times

@benchmark A2 = Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(2*S)'
BenchmarkTools.Trial: 
  memory estimate:  259.10 mb
  allocs estimate:  39
  --------------
  minimum time:     196.704 ms (1.74% GC)
  median time:      238.645 ms (16.94% GC)
  mean time:        237.304 ms (17.05% GC)
  maximum time:     275.710 ms (28.01% GC)
  --------------
  samples:          22
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

If A = B, the a different cholmod routines (aat) is called, which is only slightly faster:

@benchmark A2 = Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(S)'
BenchmarkTools.Trial: 
  memory estimate:  220.99 mb
  allocs estimate:  32
  --------------
  minimum time:     177.867 ms (1.62% GC)
  median time:      194.884 ms (1.47% GC)
  mean time:        195.765 ms (1.51% GC)
  maximum time:     235.854 ms (1.35% GC)
  --------------
  samples:          26
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Also note that the full round trip (SparseMatricCSC -> CHOLMOD.Sparse -> SparseMatricCSC) is still 30x faster than doing the matrix multiplication in SparseMatricCSC directly:

@time A3 = sparse(Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(S)');
0.275268 seconds (49 allocations: 274.310 MB, 27.35% gc time)

Scratching a little bit deeper, CHOLMOD sorts the matrix product using the double transpose:

And in fact, this option (double transpose) is also implemented inside julia:

If I adapt Julia’s spmatmul function (which I called myspmatmul in the following) to force the double transpose, then I get a speed-up of ~25x:

julia> @time A = S*S'; @time A2 = myspmatmul(S,S');
  8.137340 seconds (2.99 M allocations: 312.158 MB, 2.98% gc time)
"now doing double transpose" = "now doing double transpose"
  0.308712 seconds (44 allocations: 175.272 MB, 35.40% gc time)

In essence, I just replaced the line (in spmatmul):

    C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=sortindices) # sortindices defaults to `:sortcols`

by this

    C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=:doubletranspose)

The speed-up depends actually on the size of the matrices. For small matrices (smaller than 170x170 in my case), the current default method (:sortcols) is faster. But beyond 170x170, the double transpose method is more efficient. As far as I can tell, CHODMOD uses only the double transpose which is a quite reasonable approach imho, since sparse matrices tends to be quite large for several applications.