Slow sparse matrix-vector product with symmetric matrices


@DNF: I get the same timing whether I @btime $A*$b or @time it.

As to symmetric matrix multiplication, symmetric matrices are just stored as full square matrices, with a flag telling BLAS to only look at the upper or lower triangle (as you can check by accessing and B.uplo). As a result, BLAS only needs to loop over half the entries of the matrix. Since the kernel is memory bound (most of the time is spent loading values of A from memory), this should be a speedup. It’s very strange that you get worse results with symmetric matrices, and should probably be reported as a bug in julia or OpenBLAS.


The underlying Sparse matrix could have zeros on one of the triangular parts.


Yes, that could be a solution. Are there existing libraries for products of symmetric or hermitian sparse matrices?


Thanks @kristoffer.carlsson! Is there an open issue or pull request to implement sparse symmetric matrix-vector product?


I don’t think so.


One simple implementation that confers a speedup is (warning–not really tested):

LinAlg.A_mul_B!(y::AbstractVector, A::Symmetric, x::AbstractVector) = At_mul_B!(y,, x)
This seems to be faster than `A * x`, since `At_mul_B!` is faster than `A_mul_B!`, and valid, since `A' == A`.

**Edit:** I mean to say it is faster than `A * x` for `A` sparse but not symmetric, and thus the symmetry speeds up the product.


I don’t think you can assume that is symmetric.


You’re right, I forgot about that. If you only call Symmetric(A) if A is already symmetric in the first place, it works, but that would cripple the functionality.


Here’s an implementation:


julia> include("symmetric_matvec.jl")
A_mul_B! (generic function with 92 methods)

julia> using MatrixMarket

julia> A = MatrixMarket.mmread("af_shell8.mtx");

julia> size(A)

julia> nnz(A)

julia> B = Symmetric(A);

julia> b = ones(A.n);

julia> using BenchmarkTools

julia> @benchmark B * b
  memory estimate:  3.85 MiB
  allocs estimate:  2
  minimum time:     22.198 ms (0.00% GC)
  median time:      26.073 ms (0.00% GC)
  mean time:        26.232 ms (0.87% GC)
  maximum time:     40.818 ms (0.00% GC)
  samples:          191
  evals/sample:     1

julia> @benchmark A * b
  memory estimate:  3.85 MiB
  allocs estimate:  2
  minimum time:     21.040 ms (0.00% GC)
  median time:      23.167 ms (0.00% GC)
  mean time:        23.644 ms (1.27% GC)
  maximum time:     29.607 ms (6.07% GC)
  samples:          211
  evals/sample:     1


huh, I wonder why that’s the case…


I guess because you can more naturally traverse the matrix in column order instead of in row order when performing the multiplication. Just write out the product as a sum over the indices to see why.

I presume that BLAS may have some tricks up its sleeve to optimize memory access in any case, but at least the naive implementation of At_mul_B! should be faster than A_mul_B!.


You can just as well traverse the matrix in column order, multiplying the first element of the vector with the first column, and then the second element with the second column.


Yeah, but then the result of each of those multiplications go into separate entries in the output vector, instead of each column-vector product accumulating into a single element in the output vector.

But no matter, I don’t really know how BLAS optimizes this stuff. I have just observed that At_mul_B! is faster by some significant amount, and that it seems easier to optimize it in a naive way.


Yes At_mul_B is faster for CSC while it is the opposite for CSR.


@kristoffer.carlsson Can the above implementation be improved in your opinion? What would be missing for a PR? All this stuff?


It is tricky. I’m not sure it is so good to extract the triangular part using triu and tril directly when calling Symmetric because I think the expectation is that Symmetric is just a wrapper and doesn’t modify the data in anyway. On the other hand, having to extract the triangular part every time a matvec is done will be expensive so that is not really good either.


I would think Symmetric is handy in that you can assemble only one triangle and claim it contains all the information of a symmetric operator. It’s often the case for matrices from the UFL collection, or in applications (e.g., Hessians in optimization). I’ll see if I can scan only one triangle efficiently without calling tril() or triu().


I posted two other variants at the gist above along with benchmark results. The variants are:

  1. Symmetric() calls tril() or triu() and A_mul_B!() is straightforward;
  2. Symmetric() does not modify A, and A_mul_B!() skip irrelevant indices;
  3. Symmetric() does not modify A but builds an index of relevant indices (corresponding to either the upper or lower triangle), and A_mul_B!() relies on that index.

Variant 1 has the disadvantage of modifying A. However, it seems to me that one of the reasons for using Symmetric is to save storage. This allows users to only assemble one triangle (e.g., in FEM or optimization). In addition, the sparse symmetric factorization packages I know only access one triangle. One could imagine requiring the input A to be triangular when calling Symmetric().

Variant 2 has the advantage of not modifying A and not requiring extra storage. Strangely, multiplying with the lower triangle by scanning row indices backward is slower than multiplying with the upper triangle (and scanning forward). Perhaps this can be improved?! The performance of the product with the upper triangle is basically the same as that of Variant 1.

Variant 3 costs an extra array of length n (the number of columns of A) that gives the initial/final index for each column. The performance is basically the same as that of Variant 1.



Option 2. sounds like the right approach. It would be great with a PR.


Any update on this? I can still confirm that Symmetric slows down the matrix-vector multiplication by a factor of 100 on my machine.

Inverse of a symmetric matrix is not symmetric?