Speed comparison: Sparse matrix multiplication vs usual matrix multiplication

question

#1

Why is sparse * dense slower than dense * dense?

x = sprand(1000,1000,.1);
fx = full(x);
y = rand(1000,1000);

@time x*y;
  0.925036 seconds (31.94 k allocations: 9.071 MB)

@time x*y;
  0.885278 seconds (7 allocations: 7.630 MB)

@time fx*y;
  0.548416 seconds (398.70 k allocations: 21.551 MB, 11.83% gc time)

@time fx*y;
  0.086072 seconds (7 allocations: 7.630 MB, 59.55% gc time)

Reading #1831, I also tried a different sparse type

xa = Base.SparseArrays.CHOLMOD.Sparse(x);

@time xa*y;
  0.147444 seconds (51.25 k allocations: 17.378 MB)

@time xa*y;
  0.097434 seconds (5.47 k allocations: 15.518 MB, 5.12% gc time)

which is still not faster than dense * dense above (only for first call) but has vastly more allocations.

So, why is sparse * dense slower than dense * dense? Although I would also like to see an explanation, my primary interest is a convient way to exploit O(N^2) scaling of sparse multiplications over O(N^3) scaling for dense matrix multiplications.

To me this seems to be a highly important issue for many applications.


#2

Because your sparse matrix isn’t very sparse at all.


#3

Ok, I thought that 90% zeros is sparse. It seems like one needs ~99% zeros. Thank you for your answer.


#4

here is a comparision with octave:

>> x  = sprand(1000,1000,0.1); fx = full(x); y = rand(1000,1000);
>> tic; x*y; toc()
Elapsed time is 0.448854 seconds.
>> tic; fx*y; toc()
Elapsed time is 0.077558 seconds.

while julia on my system (only the second invocation):

@time x*y;
  0.965171 seconds (7 allocations: 7.630 MB, 7.46% gc time)
@time fx*y;
  0.079769 seconds (7 allocations: 7.630 MB, 7.14% gc time)

So for full matrix the speed of julia is equivalent with octave, for sparse matrices octave is about 2x faster. For Matlab,
both products are about the same speed:

>> tic; x*y; toc()
Elapsed time is 0.092190 seconds.
>> tic; fx*y; toc()
Elapsed time is 0.081640 seconds.

Maybe, matlab has some heuristics which turn the matrix into a full matrix before multiplication. In my opinion, there is room for improvement for julia.


#5

Never benchmark in global scope ; see the performance tips in the manual.


#6

It is a good advice in general but it shouldn’t matter for this case since the time is large enough and only spent inside another function.


#7

Indeed, the result are quite similar with a local scope (and also with the @benchmark macro).

julia> function myfun()
       x = sprand(1000,1000,.1);
       fx = full(x);
       y = rand(1000,1000);
       @time x*y;
       @time fx*y;
       return nothing
       end

julia> myfun()
  0.897603 seconds (3 allocations: 7.630 MB)
  0.076642 seconds (3 allocations: 7.630 MB)

julia> myfun()
  0.894705 seconds (3 allocations: 7.630 MB)
  0.075856 seconds (3 allocations: 7.630 MB, 1.32% gc time)

#8

Dense * dense products are handled by BLAS which (a) rearranges the computation to work on blocks so most memory accesses are in local cache, (b) works on sequential elements of memory so SIMD instructions are available, © has simple indexing structure so other run-time processor optimizations like prefetch work, and (d) is done by multi-threaded libraries on most Julia installations. Multi-threading would help many sparse * dense products, but is not yet standard in Julia itself. The other tools would only apply to sparse matrices with special structure.