Fastest way to compute `A^n * x` for large matrices `A` and vector `x`

jlbosse · April 20, 2020, 1:26pm

I have the following two (functionally equivalent) functions

function mul1(A, xin)
    x1 = copy(xin)
    x2 = A*x1
    for i in 1:20
        x2 = A*x1
        x1, x2 = x2, x1
    end
    return x1
end

and

function mul2(A, xin)
    x1 = copy(xin)
    x2 = similar(xin)
    mul!(x2, A, x1)
    for  i in 1:20
        mul!(x2, A, x1)
        x1, x2 = x1, x2
    end
    return x1
end

where in the second one I use the more lower level mul!() function and take care of memory allocation myself while the first I let julia take care of that. I expected the second to be faster, because it needs less (only 2) memory allocations and garbage collection than the first (which needs one for each loop).

But instead I got

julia> @benchmark mul1(A, xin)

BenchmarkTools.Trial: 
  memory estimate:  206.87 MiB
  allocs estimate:  44
  --------------
  minimum time:     609.226 ms (0.65% GC)
  median time:      631.570 ms (0.29% GC)
  mean time:        631.937 ms (0.37% GC)
  maximum time:     662.402 ms (0.58% GC)
  --------------
  samples:          8
  evals/sample:     1

and

julia> @benchmark mul2(A, xin)

BenchmarkTools.Trial: 
  memory estimate:  18.81 MiB
  allocs estimate:  4
  --------------
  minimum time:     612.841 ms (0.00% GC)
  median time:      648.925 ms (0.00% GC)
  mean time:        662.252 ms (0.03% GC)
  maximum time:     745.093 ms (0.00% GC)
  --------------
  samples:          8
  evals/sample:     1

with a 600.000 x 600.000 sparse matrix A and correspondingly large xin.

Why are both functions equally fast, even though the second needs only a fraction of memory allocations and garbage collection calls?

Per · April 20, 2020, 1:43pm

Your matrices and vectors are so large that allocation and garbage collection take very little time compared to a single matrix-vector multiplication.

jlbosse · April 20, 2020, 2:03pm

Makes sense. Then I will stick with the first version because it is better legible and the performance penalty barely existent.

StefanKarpinski · April 20, 2020, 2:19pm

Also, this code keeps allocating and releasing vectors of the same size, so it’s a relatively straightforward memory pattern: keep bump allocating temporary vectors until a GC happens, then start over with probably the very same memory.

StefanKarpinski · April 20, 2020, 2:29pm

If you’re going to be doing the same computation with different vectors but the same A^n, it may be much faster to precompute that matrix and do a single matvec. If n is sufficiently large, it may be faster even if you use A^n only a single time since A^n can be computed with log₂(n) matmuls, which could be cheaper than n matvecs. There are accuracy considerations too, although it’s not obvious to me which would be more accurate.

Mason · April 21, 2020, 9:08pm

I was surprised by your result until I noticed this line. In-place sparse linear algebra has relatively few advantages over just allocating new sparse arrays, which is in sharp contrast to the situation with dense linear algebra.

Mason · April 21, 2020, 10:02pm

I don’t know if this is beneficial for you use case, but if your sparse matrices have few relevant singular values, you can take advantage of the fact that

A = U ~ S ~V^\dagger

where S is a diagonal matrix of positive definite singular values. Hence,

\begin{aligned} A^n &= U ~S ~V^\dagger ~U ~ S~V^\dagger ~...~U ~ S ~V^\dagger\\ & = U ~ S^n ~ V^\dagger \end{aligned}

so

A^n x = U ~(S^n~(V^\dagger x)))

which can be computer quite efficiently.

RoyiAvital · April 21, 2020, 10:38pm

I think your trick is great (And well known) for dense Linear Algebra.
As for most Sparse matrices U and V will be dense.
Now think of a huge Sparse Matrix (Square if you want). Those matrices will have the size of the number of Rows squared and the number of columns squared.
Hence you may even not be able to allocate them not speaking of computing them.

Mason · April 21, 2020, 10:44pm

I’m totally open to the idea that this is almost never worth it for sparse matrices, but as I said, if few singular values are needed to approximate A, then one can use something like Arpack.svds to get only a few singular values. Suppose A is 600_000 x 600_000 (as with the original problem) and you only need 5 singular values (this is just a random assumption I’m making), then Arpack can give you a U which is 600_000 x 5 and a Vt which is 5 x 600_000. This factorization will in fact fit into memory fairly well.

RoyiAvital · April 21, 2020, 10:48pm

I agree it is totally depends on the sparsity of the Singular Values.
I wonder if there are known and tight bounds on the amount of non zeros singular values given some statistics about the sparsity of the matrix.

I am not sure if it makes sense even since Identity matrix is highly sparse yet it has no vanishing singular values.

Mason · April 21, 2020, 11:02pm

Many important, gigantic sparse matrices in physics have very few singular values and I’m sure other fields have examples too. If you want an easy way to generate test matrices with n singular values, you can just take n outer products between sparse vectors.

julia> using Arpack, SparseArrays

julia> let N = 600_000, nsv = 5, ρ=0.01
           A = sum(1:nsv) do _
               sprandn(N, ρ)*sprandn(N, ρ)'
           end
           density(A::SparseMatrixCSC) = nnz(A)/length(A) 
           @show density(A)
       
           s = svds(A, nsv=2nsv)[1]
           s.S #singular values
       end
density(A) = 0.0005105040027777777
10-element Array{Float64,1}:
 6118.190119132009
 6104.728605083264
 6081.746002759836
 5991.5193616731
 5940.291756523407
    0.0
    0.0
    0.0
    0.0
    0.0

It seems that with some tests, the method I proposed is slower than the method that the OP is using, at least for the examples I can construct but it does in fact work even for ridiculously large matrices. In a similar vein to what @StefanKarpinski said, the extra work may done to do this factorization may pay off if you’re doing other operations that can be accelerated by having the factorization already computed.

jlbosse · April 22, 2020, 7:31am

I needed a fast way to compute A^n * x as part of implementing the Lanczos Algorithm, so in a way finding the singular values of the matrix is precisely what I needed for… In fact, my matrix is even hermitian, so we get

A = U S U^\dagger \quad \Rightarrow \quad A^n = U S^nU^\dagger.

But finding S is precisely what I need to compute A^n for.

jlbosse · April 22, 2020, 7:33am

Ah, I should have expressed it more clearly: While A is a sparse matrix, xin is a dense vector. Even if I started with a sparse vector xin I would get that A^n * xin is dense for sufficiently large n. Is your statement still valid in this case?

Mason · April 22, 2020, 3:54pm

Ah, I see well if you’re looking for a pure Julia implementation of the Lanczos algorithm, IterativeSolvers.jl uses it.

Yeah, that would be my expectation.

Topic		Replies	Views
Effect of Column Major Storage on Multiplication Performance	3	438	November 7, 2020
Why mul! is so fast? General Usage question , linearalgebra	7	6827	November 26, 2019
Comparing performance of 2 simple averaging functions - why is one faster? Performance	5	489	August 31, 2020
Optimizing Linear Algebra Code? Performance linearalgebra	12	1383	April 9, 2021
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3267	August 12, 2020

Fastest way to compute `A^n * x` for large matrices `A` and vector `x`

Related topics