The CSC matrices have sorted storage, so they are faster in matrix-vector multiplication, which is the most frequent use of sparse matrices in many algorithms. Profiling shows that the great majority of the time in your S*S' is taken up by the sorting function which puts the result back into this form. The sorting scheme which is actually used by the CSC matrix code looks suboptimal, but improving it would require significant effort. (Hint to someone with more free time than I have: use a dedicated sortperm! based on RadixSort.)
Many algorithm formulations which appear to need S*S' can be rearranged to work with just S and S' separately. If you can do that, you get the benefits of CSC multiplication without the up-front delay.
Thank you for your insight. Indeed, on my machine, 96% of the time is done for sorting the indices. Every column (or row) has just 3 non-zero values. Maybe the used sorting algorithm is not efficient for short arrays.
In essence, I just replaced the line (in spmatmul):
C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=sortindices) # sortindices defaults to `:sortcols`
C = SparseArrays.sortSparseMatrixCSC!(Cunsorted, sortindices=:doubletranspose)
The speed-up depends actually on the size of the matrices. For small matrices (smaller than 170x170 in my case), the current default method (:sortcols) is faster. But beyond 170x170, the double transpose method is more efficient. As far as I can tell, CHODMOD uses only the double transpose which is a quite reasonable approach imho, since sparse matrices tends to be quite large for several applications.