Slow sparse matrix-vector product with symmetric matrices

DNF · May 29, 2017, 2:20pm

You’re right, I forgot about that. If you only call Symmetric(A) if A is already symmetric in the first place, it works, but that would cripple the functionality.

dpo · May 29, 2017, 6:44pm

Here’s an implementation: https://gist.github.com/481b0c03dd08d26af342573df98ddc21

Timings:

julia> include("symmetric_matvec.jl")
A_mul_B! (generic function with 92 methods)

julia> using MatrixMarket

julia> A = MatrixMarket.mmread("af_shell8.mtx");

julia> size(A)
(504855,504855)

julia> nnz(A)
17579155

julia> B = Symmetric(A);

julia> b = ones(A.n);

julia> using BenchmarkTools

julia> @benchmark B * b
BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  2
  --------------
  minimum time:     22.198 ms (0.00% GC)
  median time:      26.073 ms (0.00% GC)
  mean time:        26.232 ms (0.87% GC)
  maximum time:     40.818 ms (0.00% GC)
  --------------
  samples:          191
  evals/sample:     1

julia> @benchmark A * b
BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  2
  --------------
  minimum time:     21.040 ms (0.00% GC)
  median time:      23.167 ms (0.00% GC)
  mean time:        23.644 ms (1.27% GC)
  maximum time:     29.607 ms (6.07% GC)
  --------------
  samples:          211
  evals/sample:     1

jebej · May 29, 2017, 6:52pm

huh, I wonder why that’s the case…

DNF · May 29, 2017, 6:58pm

I guess because you can more naturally traverse the matrix in column order instead of in row order when performing the multiplication. Just write out the product as a sum over the indices to see why.

I presume that BLAS may have some tricks up its sleeve to optimize memory access in any case, but at least the naive implementation of At_mul_B! should be faster than A_mul_B!.

jebej · May 29, 2017, 7:03pm

You can just as well traverse the matrix in column order, multiplying the first element of the vector with the first column, and then the second element with the second column.

DNF · May 29, 2017, 7:18pm

Yeah, but then the result of each of those multiplications go into separate entries in the output vector, instead of each column-vector product accumulating into a single element in the output vector.

But no matter, I don’t really know how BLAS optimizes this stuff. I have just observed that At_mul_B! is faster by some significant amount, and that it seems easier to optimize it in a naive way.

kristoffer.carlsson · May 29, 2017, 8:11pm

Yes At_mul_B is faster for CSC while it is the opposite for CSR.

dpo · May 29, 2017, 10:06pm

@kristoffer.carlsson Can the above implementation be improved in your opinion? What would be missing for a PR? All this stuff?

kristoffer.carlsson · May 30, 2017, 7:07am

It is tricky. I’m not sure it is so good to extract the triangular part using triu and tril directly when calling Symmetric because I think the expectation is that Symmetric is just a wrapper and doesn’t modify the data in anyway. On the other hand, having to extract the triangular part every time a matvec is done will be expensive so that is not really good either.

dpo · May 30, 2017, 12:24pm

I would think Symmetric is handy in that you can assemble only one triangle and claim it contains all the information of a symmetric operator. It’s often the case for matrices from the UFL collection, or in applications (e.g., Hessians in optimization). I’ll see if I can scan only one triangle efficiently without calling tril() or triu().

dpo · June 1, 2017, 3:28am

I posted two other variants at the gist above along with benchmark results. The variants are:

Symmetric() calls tril() or triu() and A_mul_B!() is straightforward;
Symmetric() does not modify A, and A_mul_B!() skip irrelevant indices;
Symmetric() does not modify A but builds an index of relevant indices (corresponding to either the upper or lower triangle), and A_mul_B!() relies on that index.

Variant 1 has the disadvantage of modifying A. However, it seems to me that one of the reasons for using Symmetric is to save storage. This allows users to only assemble one triangle (e.g., in FEM or optimization). In addition, the sparse symmetric factorization packages I know only access one triangle. One could imagine requiring the input A to be triangular when calling Symmetric().

Variant 2 has the advantage of not modifying A and not requiring extra storage. Strangely, multiplying with the lower triangle by scanning row indices backward is slower than multiplying with the upper triangle (and scanning forward). Perhaps this can be improved?! The performance of the product with the upper triangle is basically the same as that of Variant 1.

Variant 3 costs an extra array of length n (the number of columns of A) that gives the initial/final index for each column. The performance is basically the same as that of Variant 1.

Comments?

andreasnoack · June 1, 2017, 4:59pm

Option 2. sounds like the right approach. It would be great with a PR.

mohamed82008 · January 15, 2018, 3:35pm

Any update on this? I can still confirm that Symmetric slows down the matrix-vector multiplication by a factor of 100 on my machine.

klacru · July 28, 2019, 5:04am

This issue should be solved in julia version 1.2 since a while. Please look at #30018 and #32689

mohamed82008 · July 28, 2019, 7:19am

Cool, that’s awesome! Thanks for your efforts.

Topic		Replies	Views
Product of two symmetric matrices: LoopVectorization.jl vs LinearAlgebra Performance blas , linearalgebra , loopvectorization	9	975	August 31, 2021
Inverting a symmetric matrix is not faster than inverting a random one Performance linearalgebra	16	1173	March 28, 2023
Symmetric matrix operations not faster General Usage	6	1009	March 3, 2021
Multithreaded MatVec Numerics multithreading , matrices	10	1964	February 4, 2022
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4547	December 20, 2017

Slow sparse matrix-vector product with symmetric matrices

Related topics