julia> A = rand(1000,1000); A = A + A'; A = sparse(A);
julia> e = ones(1000);
julia> @time A * e
0.054203 seconds (17.26 k allocations: 990.259 KiB)
julia> @time A * e;
0.001510 seconds (5 allocations: 8.094 KiB)
julia> @time A * e;
0.001486 seconds (5 allocations: 8.094 KiB)
julia> B = Symmetric(A);
julia> @time B * e;
0.220572 seconds (70.48 k allocations: 3.445 MiB)
julia> @time B * e;
0.062014 seconds (5 allocations: 8.094 KiB)
julia> @time B * e;
0.063677 seconds (5 allocations: 8.094 KiB)
Why is the product with B so much slower?
With the fairly large symmetric sparse matrix https://www.cise.ufl.edu/research/sparse/MM/Schenk_AFE/af_shell8.tar.gz, MatrixDepot downloads a Symmetric{...} matrix and a matrix-vector product never returns (I’m using Julia 0.5.2). By contrast, downloading the file directly and using MatrixMarket, the matrix-vector product is almost instantaneous (the matrix isn’t cast as Symmetric{...}).
I’m not able to test Julia 0.6 because loading the matrix using MatrixMarket never returns (see https://github.com/JuliaSparse/MatrixMarket.jl/issues/27). The same happens with MatrixDepot, which presumably relies on MatrixMarket.jl.
Secondly, I don’t understand why you are wrapping a dense matrix in a sparse wrapper, that will be extremely inefficient. Only use sparse if your matrix is really, very sparse.
Thirdly, I wouldn’t expect any gains from declaring A as symmetric like you are doing, since there is no way to exploit the symmetry in a simple matrix-vector product, and the special indexing will have some extra cost. It might also hinder calls to BLAS (not certain about this.)
(Oh, and btw, e is bound to the mathematical constant: e (mathematical constant) - Wikipedia, so it’s better not to overwrite it, though you can if you really want to.)
You can do your benchmarks like this:
using BenchmarkTools
A = rand(1000, 1000)
A = A + A'
Asp = sparse(A)
Asym = Symmetric(A)
x = rand(1000)
julia> @benchmark $A * $x
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 125.456 μs (0.00% GC)
median time: 177.903 μs (0.00% GC)
mean time: 181.923 μs (0.00% GC)
maximum time: 2.567 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark $Asp * $x
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 855.416 μs (0.00% GC)
median time: 944.306 μs (0.00% GC)
mean time: 1.001 ms (0.00% GC)
maximum time: 2.199 ms (0.00% GC)
--------------
samples: 4962
evals/sample: 1
julia> @benchmark $Asym * $x
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 206.220 μs (0.00% GC)
median time: 209.880 μs (0.00% GC)
mean time: 227.435 μs (0.00% GC)
maximum time: 2.879 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
As you can see, using an ordinary dense matrix is the fastest, a symmetric matrix carries some overhead, while a sparse product is much slower, because you matrix isn’t sparse at all.
Global scope does not matter here, the overhead is negligible. Symmetry does help MV products (I would assume because this is memory-bound, and symmetry reduces memory access by a factor of two). It’s completely true that sparse is a bad idea here, but it is indicative of performance problems for more reasonable setups, e.g. there’s a factor of more than 1e4 here:
using BenchmarkTools
A = sprandn(2000,2000,2e-3); A = A + A';
e = ones(2000);
@btime $A*$e;
B = Symmetric(A);
@btime $B * $e;
A=rand(1000,1000); A=A+A'; B=Symmetric(A); x=rand(1000);
for nthreads in [1,2,4,8]
BLAS.set_num_threads(nthreads)
println("nthreads=$nthreads")
@btime $A*$x; @btime $B*$x;
end
This is on a machine with four CPU cores, plus hyperthreading, and two memory channels. Since matrix*vector is mostly memory-bound (each matrix elements is read just once), I can understand if the symmetric case is faster, as less memory needs to be accessed. @DNF, what is your machine’s architecture?
Yes, benchmarking in global scope will be slower but that is not the issue here. If the function you are calling is big enough, the overhead is negligible.
Timings are slower, but the actual issue is that there is no specialized method for symmetric sparse matrix - dense vector multiplication.
Also, the Symmetric type does not currently save memory, since it wraps a full matrix (whether sparse or dense). Ideally, there could be a specialized SparseSymmetric type that would only store the required elements.
I agree with this. But my experience is that timing in global scope does not slow down everything by a fixed or proportional amount, but rather that some calls are dramatically affected, and others less so; leading to an unreliable impression not just of absolute performance, but also of relative performance.
@DNF: I get the same timing whether I @btime $A*$b or @time it.
As to symmetric matrix multiplication, symmetric matrices are just stored as full square matrices, with a flag telling BLAS to only look at the upper or lower triangle (as you can check by accessing B.data and B.uplo). As a result, BLAS only needs to loop over half the entries of the matrix. Since the kernel is memory bound (most of the time is spent loading values of A from memory), this should be a speedup. It’s very strange that you get worse results with symmetric matrices, and should probably be reported as a bug in julia or OpenBLAS.
One simple implementation that confers a speedup is (warning–not really tested):
LinAlg.A_mul_B!(y::AbstractVector, A::Symmetric, x::AbstractVector) = At_mul_B!(y, A.data, x)
```
This seems to be faster than `A * x`, since `At_mul_B!` is faster than `A_mul_B!`, and valid, since `A' == A`.
**Edit:** I mean to say it is faster than `A * x` for `A` sparse but not symmetric, and thus the symmetry speeds up the product.