The following sparse matrix multiplication takes about 8.124084 seconds (2.99 M allocations: 312.158 MB, 1.59% gc time).

include("/path/to/sparse_diff.jl")
S = sparse_diff((1000,1000),1);
A = S*S';
@time A = S*S';

Here I am using julia 0.5.0. As a comparison, the equivalent code takes 0.119242 seconds in octave 3.8.1 and 0.122009 seconds in matlab (R2013a).

However, if the type Base.SparseArrays.CHOLMOD.Sparse is used, the sparse matrix multiplication takes only 0.086979 seconds (93 times faster than the plain sparse type).

The output of this snipped is: 0.086979 seconds (24 allocations: 99.061 MB, 1.28% gc time).
Even when the conversion is done as part of the matrix multiplication, it is still significantly faster:

The CSC matrices have sorted storage, so they are faster in matrix-vector multiplication, which is the most frequent use of sparse matrices in many algorithms. Profiling shows that the great majority of the time in your S*S' is taken up by the sorting function which puts the result back into this form. The sorting scheme which is actually used by the CSC matrix code looks suboptimal, but improving it would require significant effort. (Hint to someone with more free time than I have: use a dedicated sortperm! based on RadixSort.)

Many algorithm formulations which appear to need S*S' can be rearranged to work with just S and S' separately. If you can do that, you get the benefits of CSC multiplication without the up-front delay.

Thank you for your insight. Indeed, on my machine, 96% of the time is done for sorting the indices. Every column (or row) has just 3 non-zero values. Maybe the used sorting algorithm is not efficient for short arrays.

When I make a matrix multiplication for A != B, I get the following times

@benchmark A2 = Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(2*S)'
BenchmarkTools.Trial:
memory estimate: 259.10 mb
allocs estimate: 39
--------------
minimum time: 196.704 ms (1.74% GC)
median time: 238.645 ms (16.94% GC)
mean time: 237.304 ms (17.05% GC)
maximum time: 275.710 ms (28.01% GC)
--------------
samples: 22
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%

If A = B, the a different cholmod routines (aat) is called, which is only slightly faster:

@benchmark A2 = Base.SparseArrays.CHOLMOD.Sparse(S)*Base.SparseArrays.CHOLMOD.Sparse(S)'
BenchmarkTools.Trial:
memory estimate: 220.99 mb
allocs estimate: 32
--------------
minimum time: 177.867 ms (1.62% GC)
median time: 194.884 ms (1.47% GC)
mean time: 195.765 ms (1.51% GC)
maximum time: 235.854 ms (1.35% GC)
--------------
samples: 26
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%

Also note that the full round trip (SparseMatricCSC -> CHOLMOD.Sparse -> SparseMatricCSC) is still 30x faster than doing the matrix multiplication in SparseMatricCSC directly:

In essence, I just replaced the line (in spmatmul):

C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=sortindices) # sortindices defaults to `:sortcols`

by this

C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=:doubletranspose)

The speed-up depends actually on the size of the matrices. For small matrices (smaller than 170x170 in my case), the current default method (:sortcols) is faster. But beyond 170x170, the double transpose method is more efficient. As far as I can tell, CHODMOD uses only the double transpose which is a quite reasonable approach imho, since sparse matrices tends to be quite large for several applications.