Potential performance regression wrt 0.6: Products involving diagonal matrices MUCH slower than with general sparse matrices

FHulsemann · September 20, 2019, 4:02pm

Hello. It so happens that I need to normalize sparse (real) matrices from time to time. Moving to v0.7 (and later), I have come across some surprising behaviour involving Diagonal (sparse) matrices.

This example illustrates the point (of course, in real life, I do not multiply small, sparse identity matrices, but the performance drop is the same):

using SparseArrays
using LinearAlgebra
using Profile

function diag_times_sparse(nrows, pswitch)
    spmat = sparse(1.0*I,nrows, nrows)
    diagmat = Diagonal(diag(spmat))
    if pswitch > 1
       @profile resmat = diagmat*spmat*diagmat
       Profile.print(C=true, sortedby=:count)
    else
       @time resmat = diagmat*spmat*diagmat
    end
    return resmat
end

function sparse_times_sparse(nrows)
    spmat1 = sparse(1.0*I,nrows, nrows)
    spmat2 = sparse(1.0*I,nrows, nrows)
    @time resmat = spmat2*spmat1*spmat2
    return resmat
end

diag_times_sparse(5,1)
sparse_times_sparse(5)

diag_times_sparse(1000,1)
#diag_times_sparse(1000,2) if one wants profile data
sparse_times_sparse(1000)

On my machine (julia 0.6.4, Linux x86_64, binary distribution), I get as output
0.000016 seconds (8 allocations: 864 bytes)
0.187978 seconds (35.96 k allocations: 1.816 MiB)
0.000032 seconds (8 allocations: 47.844 KiB)
0.000103 seconds (20 allocations: 158.656 KiB)
the first two lines being the calls to get the functions compiled.

With v0.7 (again, plain vanilla binary download), I get:
0.000019 seconds (12 allocations: 1.063 KiB)
0.146651 seconds (175.73 k allocations: 8.672 MiB)
1.675994 seconds (14 allocations: 48.172 KiB)
0.000080 seconds (18 allocations: 158.469 KiB),

and finally, 1.3rc2 (plain vanilla binary distribution):
0.000022 seconds (18 allocations: 1.313 KiB)
0.000598 seconds (16 allocations: 1.438 KiB)
3.462305 seconds (22 allocations: 48.531 KiB)
0.001336 seconds (16 allocations: 122.250 KiB)

In v0.7, the profiler shows that one spends one’s time in the interpreter, even during the second call.

A hand-rolled normalization function works just fine.

Hope this helps to put someone on the right track.

Keep the impressive work going!

ffevotte · September 20, 2019, 7:32pm

(Hello Frank and welcome back!)

Thanks for this report. After some investigation, it looks like a specialization of the Matrix*Matrix multiplication is defined here for the specific case of a SparseMatrix*Diagonal product. However, this specialization expects the Diagonal operand to be defined by a Vector subtype:

function mul!(C::AbstractSparseMatrixCSC, A::AbstractSparseMatrixCSC, D::Diagonal{T, <:Vector}) where T

whereas in your specific example the type of the Diagonal matrix is:

julia> typeof(diagmat)
Diagonal{Float64,SparseVector{Float64,Int64}}

julia> typeof(diagmat) <: Diagonal{Float64, <:Vector}
false

julia> typeof(diagmat) <: Diagonal{Float64, <:AbstractVector}
true

I would think that this is an overspecialization of this implementation of mul!, which could safely be fixed to accept Diagonal{Float64, <:AbstractVector}.

Below is a complete example monkey-patching SparseArrays in order to demonstrate that this would greatly improve performances (I’m on v1.1.0, hence the slightly different code w.r.t the version linked above):

using SparseArrays

# multiply by diagonal matrix as vector
@eval SparseArrays function mul!(C::SparseMatrixCSC, A::SparseMatrixCSC, D::Diagonal{T, <:AbstractVector}) where T
    m, n = size(A)
    b    = D.diag
    (n==length(b) && size(A)==size(C)) || throw(DimensionMismatch())
    copyinds!(C, A)
    Cnzval = C.nzval
    Anzval = A.nzval
    resize!(Cnzval, length(Anzval))
    for col = 1:n, p = A.colptr[col]:(A.colptr[col+1]-1)
        @inbounds Cnzval[p] = Anzval[p] * b[col]
    end
    C
end


using LinearAlgebra
using BenchmarkTools

nrows = 1000
spmat1 = sparse(1.0*I,nrows, nrows);
spmat2 = sparse(1.0*I,nrows, nrows);
diagmat = Diagonal(diag(spmat2));
@btime $spmat1*$diagmat;
@btime $spmat1*$spmat2;

yielding:

julia> @btime $spmat1*$diagmat;
  48.232 μs (4 allocations: 23.92 KiB)

julia> @btime $spmat1*$spmat2;
  19.026 μs (7 allocations: 61.11 KiB)

There is still some loss of performance w.r.t the SparseMatrix*SparseMatrix product, but at least the order of magnitude/complexity is now right.

Unless someone more knowledgeable chimes in to point out a flaw in the proposal above, I’ll try to submit a PR to fix this.

Pbellive · September 20, 2019, 8:07pm

Your solution looks reasonable to me @ffevotte. Thanks for offering to submit a PR! Although I think there should continue to a be a specialized method for the case of multiplying sparse and diagonal matrices, the fact that very small oversights like this can lead to falling back to generic abstract array methods and give these absolutely massive performance hits is yet another argument for pursuing something like what @klacru proposed in https://github.com/JuliaLang/julia/pull/31563. It would be really great if when there was a method like this missing it fell back to a (sparse,AbstractArray) method, rather than an unuseably slow fully generic (AbstractArray, AbstractArray) method.

@klacru, any chance that 31563 or something like it will get resurrected anytime soon?

klacru · September 20, 2019, 9:30pm

I just resolved merging conflicts an hope for somebody to review this PR.

tkf · September 20, 2019, 10:09pm

I think so too.

I think the point here is to avoid over specializations. Here is another low-hanging fruit:

github.com/JuliaSparse/SparseArrays.jl

Drop StridedArray restriction for some sparse mul!s?

opened 02:17AM - 11 Sep 19 UTC

tkf

Currently, `mul!(C, A, B, α, β)` for the case `A` is a sparse matrix is defined …as https://github.com/JuliaLang/julia/blob/2d4f4d26a0c5a74717d826dc7e1e62f7650a2e9d/stdlib/SparseArrays/src/linalg.jl#L34-L52 This method is not used in case `B` is `Symmetric`, sparse, etc. https://github.com/JuliaLang/julia/issues/33214#issuecomment-530181496 Would it be crazy to remove restriction `B::Union{StridedVector,AdjOrTransStridedMatrix}`? As indexing on `B` does not appear in the inner-most loop, I wonder if it is not such a bad default. Of course, it depends on how dense `A` is (`for col = 1:size(A, 2)` could be considered "inner-most" if `A` is super sparse). However, it's not immediately apparent that there is a case where the generic fallback can be much better than the code assuming that `A` is sparse.

(Of course, solving the array wrapper problem is an issue that needs more dedicated solution.)

Pbellive · September 20, 2019, 10:46pm

Loosening some of these over-specialized method signatures sounds like a good idea to me and would fix a good number of these falling back to generic abstract array method bugs. Even after taking that approach though, there could still be cases that are missed and fall back to dense array routines. My main point in the paragraph that you quoted was just that it would be good if there was a mechanism (maybe something like 31563, maybe something else) to catch the cases that fall through the cracks so that they fall back to (op)(Sparse, AbstractArray) methods instead of (op)(AbstractArray, AbstractArray) methods.

Perhaps that’s what you meant by

but I thought I’d clarify my point.

Edit: I should add that I really appreciate that folks like you @tkf, @klacru, @andreasnoack and others (apologies to the many I’ve missed) have been thinking hard about sparse arrays and sparse linear algebra in julia. I’ve been an occasional contributor to and frequent observer of the code base. Haven’t been able to think much about it lately but I’m glad that there are people taking this code seriously and keeping it moving forward.

tkf · September 20, 2019, 11:28pm

I totally agree that we need a better approach. Maybe a better type hierarchy and/or maybe some trait-based solution. Let me also mention that for highly overloaded functions like *, another problem is “ambiguity resolution hell”; you have to repeat redundant definitions to make julia happy. I hope that we can come up with a new solution that addresses all these problems.

ffevotte · September 21, 2019, 5:09pm

github.com/JuliaLang/julia

Avoid overspecialization of the SparseMatrixCSC * Diagonal product

JuliaLang:master ← ffevotte:SparseMatrix_Diagonal_prod

opened 05:05PM - 21 Sep 19 UTC

ffevotte

+2 -2

A recent [thread on discourse](https://discourse.julialang.org/t/potential-perfo…rmance-regression-wrt-0-6-products-involving-diagonal-matrices-much-slower-than-with-general-sparse-matrices/29005?u=ffevotte) describes a use case where the product between a `SparseMatrixCSC` and a `Diagonal` matrix falls back to a generic, dense algorithm. This of course entails very significant losses in performance. It looks like in this case, the problem comes from the [relevant `mul!` methods](https://github.com/JuliaLang/julia/blob/34d2b87b65b1643b1055b10aa5ea7d2bdbcf6cd2/stdlib/SparseArrays/src/linalg.jl#L1362-L1389) being overly specialized because they accept only `Diagonal{T, <:Vector}` operands. I think this could safely be generalized to `Diagonal{T, <:AbstractVector}` operands, and this is what this PR proposes. Below is a test case illustrating the performance issues: ```julia using SparseArrays using LinearAlgebra using BenchmarkTools nrows = 1000 spmat1 = sparse(1.0*I,nrows, nrows); spmat2 = sparse(1.0*I,nrows, nrows); diagmat = Diagonal(diag(spmat2)); @btime $spmat1*$diagmat; @btime $spmat1*$spmat2; ``` yielding (on my machine and for the current `master` branch): ```julia julia> @btime $spmat1*$diagmat; 1.145 s (11 allocations: 24.27 KiB) julia> @btime $spmat1*$spmat2; 21.097 μs (8 allocations: 61.13 KiB) ``` and with the current PR: ```julia julia> @btime $spmat1*$diagmat; 46.593 μs (5 allocations: 23.94 KiB) julia> @btime $spmat1*$spmat2; 20.650 μs (8 allocations: 61.13 KiB) ``` <br/> I'm not sure where to go from here. This is "only" a performance issue, so I'm not sure how to formalize its resolution with a test case in order to avoid potential, future regressions. I also don't think it needs any further documentation. But this is my first PR and I may very well be missing something here; please do not hesitate to tell me if there is anything I could do to improve this PR. Thanks!

Topic		Replies	Views
Mul! vs SparseMatrix Performance	2	1083	December 12, 2020
Speed up sparse matrix multiplication Performance	8	1744	August 31, 2020
Scaling a sparse matrix row-wise and column-wise too slow Performance broadcast , sparse	20	449	June 23, 2024
Right multiplication of Diagonal matrix Numerics	2	1817	August 29, 2018
Efficient sparse-dense matrix multiplication for diagonal-like sparse matrices Performance question	8	799	January 22, 2023

Potential performance regression wrt 0.6: Products involving diagonal matrices MUCH slower than with general sparse matrices

Related topics