Slow sparse matrix-vector product with symmetric matrices


#1
julia> A = rand(1000,1000); A = A + A'; A = sparse(A);
julia> e = ones(1000);
julia> @time A * e
  0.054203 seconds (17.26 k allocations: 990.259 KiB)
julia> @time A * e;
  0.001510 seconds (5 allocations: 8.094 KiB)
julia> @time A * e;
  0.001486 seconds (5 allocations: 8.094 KiB)
julia> B = Symmetric(A);
julia> @time B * e;
  0.220572 seconds (70.48 k allocations: 3.445 MiB)
julia> @time B * e;
  0.062014 seconds (5 allocations: 8.094 KiB)
julia> @time B * e;
  0.063677 seconds (5 allocations: 8.094 KiB)

Why is the product with B so much slower?

With the fairly large symmetric sparse matrix https://www.cise.ufl.edu/research/sparse/MM/Schenk_AFE/af_shell8.tar.gz, MatrixDepot downloads a Symmetric{...} matrix and a matrix-vector product never returns (I’m using Julia 0.5.2). By contrast, downloading the file directly and using MatrixMarket, the matrix-vector product is almost instantaneous (the matrix isn’t cast as Symmetric{...}).

I’m not able to test Julia 0.6 because loading the matrix using MatrixMarket never returns (see https://github.com/JuliaSparse/MatrixMarket.jl/issues/27). The same happens with MatrixDepot, which presumably relies on MatrixMarket.jl.

Thanks,


#2

I can reproduce.

I got confused by the title of the post but this is only with sparse matrices, the dense case is two times faster with Symmetric()


#3

Firstly, you should not do benchmarking in the global workspace (this is tip number 1 in the performance section of the manual: https://docs.julialang.org/en/stable/manual/performance-tips/)

Secondly, I don’t understand why you are wrapping a dense matrix in a sparse wrapper, that will be extremely inefficient. Only use sparse if your matrix is really, very sparse.

Thirdly, I wouldn’t expect any gains from declaring A as symmetric like you are doing, since there is no way to exploit the symmetry in a simple matrix-vector product, and the special indexing will have some extra cost. It might also hinder calls to BLAS (not certain about this.)

(Oh, and btw, e is bound to the mathematical constant: https://en.wikipedia.org/wiki/E_(mathematical_constant), so it’s better not to overwrite it, though you can if you really want to.)

You can do your benchmarks like this:

using BenchmarkTools

A = rand(1000, 1000)
A = A + A'
Asp = sparse(A)
Asym = Symmetric(A)
x = rand(1000)

julia> @benchmark $A * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     125.456 μs (0.00% GC)
  median time:      177.903 μs (0.00% GC)
  mean time:        181.923 μs (0.00% GC)
  maximum time:     2.567 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $Asp * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     855.416 μs (0.00% GC)
  median time:      944.306 μs (0.00% GC)
  mean time:        1.001 ms (0.00% GC)
  maximum time:     2.199 ms (0.00% GC)
  --------------
  samples:          4962
  evals/sample:     1

julia> @benchmark $Asym * $x
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     206.220 μs (0.00% GC)
  median time:      209.880 μs (0.00% GC)
  mean time:        227.435 μs (0.00% GC)
  maximum time:     2.879 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The purpose of the $ signs: https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions

As you can see, using an ordinary dense matrix is the fastest, a symmetric matrix carries some overhead, while a sparse product is much slower, because you matrix isn’t sparse at all.


#4

Global scope does not matter here, the overhead is negligible. Symmetry does help MV products (I would assume because this is memory-bound, and symmetry reduces memory access by a factor of two). It’s completely true that sparse is a bad idea here, but it is indicative of performance problems for more reasonable setups, e.g. there’s a factor of more than 1e4 here:

using BenchmarkTools
A = sprandn(2000,2000,2e-3); A = A + A'; 
e = ones(2000);
@btime $A*$e;
B = Symmetric(A);
@btime $B * $e;

That seems issue-worthy!


#5

Oddly, I find the opposite. My results:

A=rand(1000,1000); A=A+A'; B=Symmetric(A); x=rand(1000);
for nthreads in [1,2,4,8]
    BLAS.set_num_threads(nthreads)
    println("nthreads=$nthreads")
    @btime $A*$x; @btime $B*$x;
end

nthreads=1
244.932 μs (1 allocation: 7.94 KiB)
96.271 μs (1 allocation: 7.94 KiB)
nthreads=2
141.135 μs (1 allocation: 7.94 KiB)
50.940 μs (1 allocation: 7.94 KiB)
nthreads=4
155.052 μs (1 allocation: 7.94 KiB)
53.440 μs (1 allocation: 7.94 KiB)
nthreads=8
127.618 μs (1 allocation: 7.94 KiB)
30.572 μs (1 allocation: 7.94 KiB)

This is on a machine with four CPU cores, plus hyperthreading, and two memory channels. Since matrix*vector is mostly memory-bound (each matrix elements is read just once), I can understand if the symmetric case is faster, as less memory needs to be accessed. @DNF, what is your machine’s architecture?


#6

This has nothing to do with global scope or by dense being faster etc etc. The explanation is that:

julia> a = sprand(5,5,0.5); b = rand(5); c = similar(b);

julia> @edit A_mul_B!(c, Symmetric(a), b)

falls back to the generic matrix multiplication which will of course be extremely slow for sparse matrices.


#7

@antoine-levitt, @kristoffer.carlsson what is up with this, then? (this is after warm-up, and is consistent on my computer.)

julia> @btime $A * $x;
  106.906 μs (1 allocation: 7.94 KiB)

julia> @time A * x;
  0.000723 seconds (5 allocations: 8.094 KiB)

In all my tests, timings were significantly slower in global scope.

Do you mean memory use? I don’t see how it helps with memory access, it’s just a remapping of the indices, right?


#8

Interesting. Here are mine:

julia> for nthreads in [1,2,4,8]
           BLAS.set_num_threads(nthreads)
           println("nthreads=$nthreads")
           @btime $A*$x; @btime $B*$x;
       end
nthreads=1
  259.607 μs (1 allocation: 7.94 KiB)
  451.926 μs (1 allocation: 7.94 KiB)
nthreads=2
  145.693 μs (1 allocation: 7.94 KiB)
  299.363 μs (1 allocation: 7.94 KiB)
nthreads=4
  110.245 μs (1 allocation: 7.94 KiB)
  206.052 μs (1 allocation: 7.94 KiB)
nthreads=8
  166.924 μs (1 allocation: 7.94 KiB)
  246.219 μs (1 allocation: 7.94 KiB)

#9

I do agree, though, that this is the main cause of the slow-down, taking a Sparse matrix and making it Symmetric.

I just don’t see how it’s correct that overhead in global scope is negligible. If so, I am doing something very wrong…


#10

Yes, benchmarking in global scope will be slower but that is not the issue here. If the function you are calling is big enough, the overhead is negligible.


#11

Timings are slower, but the actual issue is that there is no specialized method for symmetric sparse matrix - dense vector multiplication.

Also, the Symmetric type does not currently save memory, since it wraps a full matrix (whether sparse or dense). Ideally, there could be a specialized SparseSymmetric type that would only store the required elements.


#12

I agree with this. But my experience is that timing in global scope does not slow down everything by a fixed or proportional amount, but rather that some calls are dramatically affected, and others less so; leading to an unreliable impression not just of absolute performance, but also of relative performance.

Am I alone in experiencing this?


#13

No I agree, it is generally more reliable to use Benchmark tools, or do a bunch of loops in a function.


#14

@DNF: I get the same timing whether I @btime $A*$b or @time it.

As to symmetric matrix multiplication, symmetric matrices are just stored as full square matrices, with a flag telling BLAS to only look at the upper or lower triangle (as you can check by accessing B.data and B.uplo). As a result, BLAS only needs to loop over half the entries of the matrix. Since the kernel is memory bound (most of the time is spent loading values of A from memory), this should be a speedup. It’s very strange that you get worse results with symmetric matrices, and should probably be reported as a bug in julia or OpenBLAS.


#15

The underlying Sparse matrix could have zeros on one of the triangular parts.


#16

Yes, that could be a solution. Are there existing libraries for products of symmetric or hermitian sparse matrices?


#17

Thanks @kristoffer.carlsson! Is there an open issue or pull request to implement sparse symmetric matrix-vector product?


#18

I don’t think so.


#19

One simple implementation that confers a speedup is (warning–not really tested):

LinAlg.A_mul_B!(y::AbstractVector, A::Symmetric, x::AbstractVector) = At_mul_B!(y, A.data, x)
```
This seems to be faster than `A * x`, since `At_mul_B!` is faster than `A_mul_B!`, and valid, since `A' == A`.

**Edit:** I mean to say it is faster than `A * x` for `A` sparse but not symmetric, and thus the symmetry speeds up the product.

#20

I don’t think you can assume that A.data is symmetric.