Sparse solve vs BandedMatrix; time and allocation surprise

I’m trying to understand why the SparseSuite \ is so much worse than what I get with BandedMatrices.
My test case, which reflects the results of my application, uses two matrices with the same elements, but stored differently.

Running v 1.5 on a Mac.

n=20000;
D=rand(n,);
D1=rand(n-1,);
Dm1=rand(n-1,);
D2=rand(n-2,);
Dm2=rand(n-2,);
d0=Pair(0,D); d1=Pair(1,D1); d2=Pair(2,D2);
dm1=Pair(-1,Dm1); dm2=Pair(-2,Dm2);
FVS=spdiagm(dm2,dm1,d0,d1,d2);
FVB=BandedMatrix(dm2,dm1,d0,d1,d2);
b=rand(n,);

So I run the solvers FVS\b and FVB\b and I get this

julia> @btime $FVB\$b;
2.226 ms (11 allocations: 1.37 MiB)
julia> @btime $FVS\$b;
22.556 ms (73 allocations: 25.04 MiB)

I figured the generic spase solve would be worse, but not that much worse. Is there anything I shoud be doing that I’ve missed? Is there a direct call to a function inside SparseSuite that could help?

SparseMatrices are inherently slower because they don’t really work well with things like simd, and they also cause branch prediction issues. Both of those are really big deals.

1 Like

Nope, you didn’t miss anything. It’s that much of a difference.

This is a surprise to me. The C code for SuiteSparse was written by the best in the field

http://faculty.cse.tamu.edu/davis/suitesparse.html

and seems to work very will inside Matlab, but my Matlab experience with it was on pretty small problems and I was only comparing it to the previous sparse solvers in Matlab.

It doesn’t matter how good you are if you have no information. If you can specialize on properties of the system, you can do better. SuiteSparse is as good as a completely generic sparse algorithm can get, but completely sparse generic is too generic. Even good sparse algorithms work by trying to add structure.

4 Likes

Fair enough. SuiteSparse has always done what I’ve asked it to do and been fast enough. This is the first time I’ve done a comparison like this however.

It sounds like you think that SuiteSparse is disappointingly slow? Isn’t it rather the case that BandedMatrices is impressively fast?

(Is the glass 1% empty, or 99% full?)

5 Likes

I’ve never compared the LAPACK band solver vs SuiteSparse on the same problem. In hindsight, I should have expected this. Not disappointed in SparseSuite and I knew band solvers are fast. I did not know that the difference would be so large on what seems to be an easy problem for both.

Learn something every day…

1 Like

I tried this in Matlab (same computer and OS). Matlab does some analysis before a solve (looking for spd, bands, … ) and is smart enough to use the band solver and hence is as fast as Julia. The timings (assuming you believe tic and toc) are almost identical.

>> n=20000; em2=rand(n,1); em1=rand(n,1); em0=rand(n,1);
>> e1=rand(n,1); e2=rand(n,1);
>> A=spdiags([em1 em2 em0 e1 e2], -2:2, n,n);
>> b=rand(n,1);
>> tic; c=A\b; toc
Elapsed time is 0.002861 seconds.
>> tic; [l,u]=lu(A); toc
Elapsed time is 0.005762 seconds.

I do not know how to force Matlab to use SuiteSparse. Mathworks hired Tim Davis as a consultant to get it in there, so it would be interesting to see how/if they to better than Julia.

1 Like

Open an issue. The Julia one could probably detect this case and then run a fast path.

Where should I open that issue? Base? SuiteSparse?

There’s some subtlety happening. If you want to do qr! or lu!, you need to allocate a few extra bands. If all you’re doing is backslash, then that extra room needs some management.

Julia Base.

If the matrix is really simple and has a narrow band, of course the banded solver will win. Is there surprise there?

I have tested previously SuiteSparse with both Matlab and Julia on the same matrices. They were of equivalent speed.

Winner = no surprise
Margin of victory = some surprise

I’m moving a lot of my work from Matlab → Julia and finding new things all the time. Most of what I see is well explained if I RTFM. This one was bit different.

2 Likes

I wonder how much it might cost to detect the “bandedness” and reorganize the matrix. In other words, is the detection and reorganization done after a solve is called, or before. If the latter, this cost is not part of the timing.

Not completely sure. I call tic before anything happens and toc afterwards, so if tic/toc works the way I think it does, it times everything.

What I mean is: do you suppose the storage is selected when the matrix is created, or is it changed appropriately after a solve is called?

The bandedness cost would be O(n) (check each col’s greatest and least row element). The conversion is O(nd+nnz) since you need to allocate a zero banded matrix and fill in the non-zero entries.

Good news: to create a banded matrix from a sparse array can be reasonably amortized.
With Julia 1.6, BandedMatrices v0.15.17 :

@btime $FVB=BandedMatrix($FVS)
@btime $FVB\$b;
@btime $FVS\$b;
  1.739 ms (3 allocations: 781.38 KiB)                                            
  4.180 ms (11 allocations: 1.37 MiB)                                             
  48.322 ms (73 allocations: 25.04 MiB)     
2 Likes

While the Julia BandedMatrices.jl package was written by the worst in the field :rofl:

EDIT: \ with BLAS types lowers to LAPACK’s banded LU, whose developers are also pretty good.

5 Likes