Performance of lazy wrappers applied to sparse matrices

For motivation please look at the following suite of benchmarks:

julia> m, n , nz = 10000, 10000, 1000000
(10000, 10000, 1000000)

julia> A = spzeros(m, n)
10000×10000 SparseMatrixCSC{Float64,Int64} with 0 stored entries

julia> @btime A' == A'
  2.415 s (4 allocations: 64 bytes)
true

julia> A = sprand(ComplexF64, m, n, nz / m / n);
julia> @btime A' == A'
  7.932 s (4 allocations: 64 bytes)
true

julia> x1 = A';
julia> x2 = (copy(A)');
julia> @btime x1 == x2
  12.005 s (2 allocations: 32 bytes)
true

julia> @btime x1.parent == x2.parent
  3.141 ms (0 allocations: 0 bytes)
true

Maybe you can share my feeling, that only the last figure is acceptable (0.003 seconds vs. 12 seconds).
As A' === Adjoint(A), there seems to be an issue with Adjoint and ==. Nevertheless I think, the issue is with all of those wrappers, which were introduced to improve performance, namely:
Adjoint, Transpose, UpperTriangular, LowerTriangular, Symmetric, Hermitian.
And not only the == is problematic.

julia> @btime x = abs.(UpperTriangular(spzeros(m, n))');
  776.993 ms (8 allocations: 763.02 MiB)

is an example of combinations of those wrappers, which are not taking advantage of the sparsity of the underlying matrix.

My question is: are there efforts being undertaken to improve the performance of lazily wrapped sparse Matrices in general and specially of combined wrappers? I would like to start a project on this, but want first to see, if somebody else is working in this area.

2 Likes

This seems like it’s just a missing method. Could you open an issue about this?

3 Likes

I already opened an issue on this. I would say that the sparse matrix library needs a lot of work.

https://github.com/JuliaLang/julia/issues/28432

There are some other related issues:

https://github.com/JuliaLang/julia/issues/29353

https://github.com/JuliaLang/julia/issues/28948

3 Likes

I fear there is not only one method missing.

I agree. The == example was just a motivation. Actually it is a pain to work with sparse arithmetic; my project stalled, because after a Q-R decomposition, multiplying R * x with a dense vector x did not finish in a reasonable time. I made a PR about that issue some time ago https://github.com/JuliaLang/julia/issues/28451, which will be reviewed soon, I hope.

I want to try to consolidate those related issues like yours. A similar effort is this one: https://github.com/JuliaLang/julia/pull/28883

1 Like

An unfinished effort to provide a relatively general fix can be found in https://github.com/timholy/ArrayIteration.jl. The strategy there is to create iterator types that allow you to synchronize your access of relevant entries in a matrix pair, and thereby write generic algorithms that only visit entries that need it. It’s unknown whether algorithms based on such iterators can be written without any generalization penalty (indeed, that seems a bit unlikely to me), but I expect that it should be possible to provide something that’s at least a pretty good fallback definition for almost any combination of types.

The reason this makes sense is that if there are N specialized matrix types, even binary methods (e.g., ==, *, +) result in N choose 2 method specializations you need to write. Given that N is not small (and you’ve left out Diagonal, Bidiagonal, Tridiagonal, SymTridiagonal, UnitLowerTriangular, UnitUpperTriangular, and packages like BandedMatrices), this means a lot of specializations. In contrast, writing an iterator implementation for each matrix is an O(N) problem rather than an O(N^2) problem, and thus makes the problem far more manageable.

That package is not currently at a high point in my own priority list (which is quite long), so if folks are interested I’d be happy to see someone take ownership of it. Two notes:

  • if memory serves you want to start from the teh/stored2 branch rather than master (but I could be wrong)
  • it’s possible this will only make sense once we can stack-allocate objects that contain a reference to the heap (see performance - Unexpected memory allocation when using array views (julia) - Stack Overflow), because this package creates lots of wrappers and you don’t want that to allocate. However, I am hopeful that the improved optimizer we now have would elide the wrapper creations in most cases, so perhaps with careful work it won’t be necessary to wait.
3 Likes

copy seems to help a lot for transpose

@time x' == x'
4.18 sec
@time copy(x') == copy(x')
0.00026 sec

Is there anybody “in charge” of the sparse matrix library? Fixing all these missing methods with a new iteration protocol is not a small patch-- this requires a coordinated effort of several people.

1 Like

This pull request represents a very impressive effort! But it’s not clear to me-- are you attempting to fill in all triples (left operand, right operand, operator) of missing methods, or are you adopting the approach suggested by Tim Holy and others to institute a new iteration protocol in order to reduce the combinatorial explosion at the cost of some performance?

Is there anybody “in charge” of the sparse matrix library?

Anyone who does the work. git shows that 29 people contributed to the SparseArrays stdlib in 2018 so far.

Fixing all these missing methods with a new iteration protocol is not a small patch-- this requires a coordinated effort of several people.

Agreed, it’s not a small job, which is why it’s not done already. But it does easily break into discrete tasks. The biggest task is getting the infrastructure in ArrayIteration working properly, probably using a couple of array types as test beds. The second biggest task is to write iterators for all the (remaining) array types. The final task, writing a half-dozen generic methods, will be pretty simple by comparison: == might be less than a dozen lines, for example.

Could you explain in more detail why it’s necessary to solve the “small-item-on-the-stack” issue before you can implement array iteration? In particular, the new iteration protocol in 0.7 seems to work fine even though the small-item-on-the-stack issue is still open.

Algorithms like matrix multiplication can often be compactly represented in terms of column & row iterators, e.g., https://github.com/timholy/ArrayIteration.jl/blob/6736ec22025e0b0c09d948a6ec1cecec7691e7bc/src/linalg_couple.jl#L4-L12 (wow, I got farther implementing this than I had remembered…). Here’s an example of a column iterator: https://github.com/timholy/ArrayIteration.jl/blob/6736ec22025e0b0c09d948a6ec1cecec7691e7bc/src/sparse.jl#L50-L53. It holds a reference to the original array, and hence if the compiler doesn’t inline & elide everything then it will be allocated on the heap.

It’s not at all crazy to hope that the compiler will inline and elide everything, so this might be doable today. But if you hit a roadblock, my prediction is that’s where it will happen.

1 Like

Allow me to summarize this situation:

  • You (Tim) are proposing a system of coupled iterators (simultaneous iteration over both operands). The system is elegant and powerful, a bit tricky to understand, and may require some additional core-compiler optimizations. However, it has the potential to solve almost every current performance issue in the sparse matrix library. It is not at the top of your priority list.

  • Others (including me) have proposed straightforward one-operand iterators that could solve a subset of the issues, mainly the low-hanging fruit, but would fail to address more complex matrix-matrix operations.

  • The PR 28883 mentioned earlier in this thread proposes and partly implements fully optimized individually written codes for binary operators (+, *) for many specific pairs of matrix operands.

Each of these approaches has its pros and cons, but to some extent they are competing, which makes it difficult for someone who wants to jump in.

This situation cries out for a leadership decision! Someone with more clout than me in the Julia community needs to post an overarching framework for fixing the sparse matrix library so that the rest of us can make PRs to implement the framework.

6 Likes

That’s a great summary. From a logical standpoint these might appear to be competing strategies, but honestly I think in real-world development that’s unlikely. Suppose I or someone implemented my ambitious strategy: wouldn’t you want to benchmark it against something? Wouldn’t that something be a more straightforward implementation? If that more straightforward implementation already exists, doesn’t that save the “ambitious developer” a ton of time, not to mention the fact that meanwhile other users/developers have been enjoying the benefits of something that already works well in a subset of cases?

Consequently as usual I think leadership on this issue should come “from the bottom,” meaning the people who do the work. It seems that a number of good solutions are just around the corner, and I’d urge folks to push ahead with those and get them merged. Over the longer timescale one or more people might tackle the more ambitious approach, and then benchmark against what we have now. In a few places it’s possible the general strategy will be just as good as the handwritten one and can replace it, but when that’s not true then one can keep the hyper-optimized version and use the more generic fallback to handle any missing cases.

4 Likes

If you are speaking about 8451, that is exclusively the multiplication and division of dense RHS by triangular sparse matrices. “triangular” means one of UpperTriangular, LowerTriangular, UnitUpperTriangular, UnitLowerTriangular literally and the Transpose and Adjoint of those. As for each matrix type multiplication and division is implemented, there are 24 cases to cover, which was implemented in 8 functions. In the pre-PR master branch only 6 of 24 are working at speed (only divisions, but not for Unit...Triangular).

Inspired by your issue @Stephen_Vavasis
https://github.com/JuliaLang/julia/issues/28432#issuecomment-416672359
I just started the nziterator approach.

1 Like

https://github.com/KlausC/SparseWrappers.jl