Slow sparse matrix-vector product with symmetric matrices

julia> A = rand(1000,1000); A = A + A'; A = sparse(A);
julia> e = ones(1000);
julia> @time A * e
  0.054203 seconds (17.26 k allocations: 990.259 KiB)
julia> @time A * e;
  0.001510 seconds (5 allocations: 8.094 KiB)
julia> @time A * e;
  0.001486 seconds (5 allocations: 8.094 KiB)
julia> B = Symmetric(A);
julia> @time B * e;
  0.220572 seconds (70.48 k allocations: 3.445 MiB)
julia> @time B * e;
  0.062014 seconds (5 allocations: 8.094 KiB)
julia> @time B * e;
  0.063677 seconds (5 allocations: 8.094 KiB)

Why is the product with B so much slower?

With the fairly large symmetric sparse matrix https://www.cise.ufl.edu/research/sparse/MM/Schenk_AFE/af_shell8.tar.gz, MatrixDepot downloads a Symmetric{...} matrix and a matrix-vector product never returns (I’m using Julia 0.5.2). By contrast, downloading the file directly and using MatrixMarket, the matrix-vector product is almost instantaneous (the matrix isn’t cast as Symmetric{...}).

I’m not able to test Julia 0.6 because loading the matrix using MatrixMarket never returns (see https://github.com/JuliaSparse/MatrixMarket.jl/issues/27). The same happens with MatrixDepot, which presumably relies on MatrixMarket.jl.

Thanks,

I can reproduce.

I got confused by the title of the post but this is only with sparse matrices, the dense case is two times faster with Symmetric()

Firstly, you should not do benchmarking in the global workspace (this is tip number 1 in the performance section of the manual: https://docs.julialang.org/en/stable/manual/performance-tips/)

Secondly, I don’t understand why you are wrapping a dense matrix in a sparse wrapper, that will be extremely inefficient. Only use sparse if your matrix is really, very sparse.

Thirdly, I wouldn’t expect any gains from declaring A as symmetric like you are doing, since there is no way to exploit the symmetry in a simple matrix-vector product, and the special indexing will have some extra cost. It might also hinder calls to BLAS (not certain about this.)

(Oh, and btw, e is bound to the mathematical constant: e (mathematical constant) - Wikipedia, so it’s better not to overwrite it, though you can if you really want to.)

You can do your benchmarks like this:

using BenchmarkTools

A = rand(1000, 1000)
A = A + A'
Asp = sparse(A)
Asym = Symmetric(A)
x = rand(1000)

julia> @benchmark $A * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     125.456 μs (0.00% GC)
  median time:      177.903 μs (0.00% GC)
  mean time:        181.923 μs (0.00% GC)
  maximum time:     2.567 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $Asp * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     855.416 μs (0.00% GC)
  median time:      944.306 μs (0.00% GC)
  mean time:        1.001 ms (0.00% GC)
  maximum time:     2.199 ms (0.00% GC)
  --------------
  samples:          4962
  evals/sample:     1

julia> @benchmark $Asym * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     206.220 μs (0.00% GC)
  median time:      209.880 μs (0.00% GC)
  mean time:        227.435 μs (0.00% GC)
  maximum time:     2.879 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The purpose of the $ signs: https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions

As you can see, using an ordinary dense matrix is the fastest, a symmetric matrix carries some overhead, while a sparse product is much slower, because you matrix isn’t sparse at all.

2 Likes

Global scope does not matter here, the overhead is negligible. Symmetry does help MV products (I would assume because this is memory-bound, and symmetry reduces memory access by a factor of two). It’s completely true that sparse is a bad idea here, but it is indicative of performance problems for more reasonable setups, e.g. there’s a factor of more than 1e4 here:

using BenchmarkTools
A = sprandn(2000,2000,2e-3); A = A + A'; 
e = ones(2000);
@btime $A*$e;
B = Symmetric(A);
@btime $B * $e;

That seems issue-worthy!

Oddly, I find the opposite. My results:

A=rand(1000,1000); A=A+A'; B=Symmetric(A); x=rand(1000);
for nthreads in [1,2,4,8]
    BLAS.set_num_threads(nthreads)
    println("nthreads=$nthreads")
    @btime $A*$x; @btime $B*$x;
end

nthreads=1
244.932 μs (1 allocation: 7.94 KiB)
96.271 μs (1 allocation: 7.94 KiB)
nthreads=2
141.135 μs (1 allocation: 7.94 KiB)
50.940 μs (1 allocation: 7.94 KiB)
nthreads=4
155.052 μs (1 allocation: 7.94 KiB)
53.440 μs (1 allocation: 7.94 KiB)
nthreads=8
127.618 μs (1 allocation: 7.94 KiB)
30.572 μs (1 allocation: 7.94 KiB)

This is on a machine with four CPU cores, plus hyperthreading, and two memory channels. Since matrix*vector is mostly memory-bound (each matrix elements is read just once), I can understand if the symmetric case is faster, as less memory needs to be accessed. @DNF, what is your machine’s architecture?

This has nothing to do with global scope or by dense being faster etc etc. The explanation is that:

julia> a = sprand(5,5,0.5); b = rand(5); c = similar(b);

julia> @edit A_mul_B!(c, Symmetric(a), b)

falls back to the generic matrix multiplication which will of course be extremely slow for sparse matrices.

3 Likes

@antoine-levitt, @kristoffer.carlsson what is up with this, then? (this is after warm-up, and is consistent on my computer.)

julia> @btime $A * $x;
  106.906 μs (1 allocation: 7.94 KiB)

julia> @time A * x;
  0.000723 seconds (5 allocations: 8.094 KiB)

In all my tests, timings were significantly slower in global scope.

Do you mean memory use? I don’t see how it helps with memory access, it’s just a remapping of the indices, right?

Interesting. Here are mine:

julia> for nthreads in [1,2,4,8]
           BLAS.set_num_threads(nthreads)
           println("nthreads=$nthreads")
           @btime $A*$x; @btime $B*$x;
       end
nthreads=1
  259.607 μs (1 allocation: 7.94 KiB)
  451.926 μs (1 allocation: 7.94 KiB)
nthreads=2
  145.693 μs (1 allocation: 7.94 KiB)
  299.363 μs (1 allocation: 7.94 KiB)
nthreads=4
  110.245 μs (1 allocation: 7.94 KiB)
  206.052 μs (1 allocation: 7.94 KiB)
nthreads=8
  166.924 μs (1 allocation: 7.94 KiB)
  246.219 μs (1 allocation: 7.94 KiB)

I do agree, though, that this is the main cause of the slow-down, taking a Sparse matrix and making it Symmetric.

I just don’t see how it’s correct that overhead in global scope is negligible. If so, I am doing something very wrong…

Yes, benchmarking in global scope will be slower but that is not the issue here. If the function you are calling is big enough, the overhead is negligible.

Timings are slower, but the actual issue is that there is no specialized method for symmetric sparse matrix - dense vector multiplication.

Also, the Symmetric type does not currently save memory, since it wraps a full matrix (whether sparse or dense). Ideally, there could be a specialized SparseSymmetric type that would only store the required elements.

I agree with this. But my experience is that timing in global scope does not slow down everything by a fixed or proportional amount, but rather that some calls are dramatically affected, and others less so; leading to an unreliable impression not just of absolute performance, but also of relative performance.

Am I alone in experiencing this?

No I agree, it is generally more reliable to use Benchmark tools, or do a bunch of loops in a function.

@DNF: I get the same timing whether I @btime $A*$b or @time it.

As to symmetric matrix multiplication, symmetric matrices are just stored as full square matrices, with a flag telling BLAS to only look at the upper or lower triangle (as you can check by accessing B.data and B.uplo). As a result, BLAS only needs to loop over half the entries of the matrix. Since the kernel is memory bound (most of the time is spent loading values of A from memory), this should be a speedup. It’s very strange that you get worse results with symmetric matrices, and should probably be reported as a bug in julia or OpenBLAS.

The underlying Sparse matrix could have zeros on one of the triangular parts.

Yes, that could be a solution. Are there existing libraries for products of symmetric or hermitian sparse matrices?

Thanks @kristoffer.carlsson! Is there an open issue or pull request to implement sparse symmetric matrix-vector product?

I don’t think so.

One simple implementation that confers a speedup is (warning–not really tested):

LinAlg.A_mul_B!(y::AbstractVector, A::Symmetric, x::AbstractVector) = At_mul_B!(y, A.data, x)
```
This seems to be faster than `A * x`, since `At_mul_B!` is faster than `A_mul_B!`, and valid, since `A' == A`.

**Edit:** I mean to say it is faster than `A * x` for `A` sparse but not symmetric, and thus the symmetry speeds up the product.

I don’t think you can assume that A.data is symmetric.