Fastest way to compute `A^n * x` for large matrices `A` and vector `x`

I have the following two (functionally equivalent) functions

function mul1(A, xin)
    x1 = copy(xin)
    x2 = A*x1
    for i in 1:20
        x2 = A*x1
        x1, x2 = x2, x1
    end
    return x1
end

and

function mul2(A, xin)
    x1 = copy(xin)
    x2 = similar(xin)
    mul!(x2, A, x1)
    for  i in 1:20
        mul!(x2, A, x1)
        x1, x2 = x1, x2
    end
    return x1
end

where in the second one I use the more lower level mul!() function and take care of memory allocation myself while the first I let julia take care of that. I expected the second to be faster, because it needs less (only 2) memory allocations and garbage collection than the first (which needs one for each loop).

But instead I got

julia> @benchmark mul1(A, xin)

BenchmarkTools.Trial: 
  memory estimate:  206.87 MiB
  allocs estimate:  44
  --------------
  minimum time:     609.226 ms (0.65% GC)
  median time:      631.570 ms (0.29% GC)
  mean time:        631.937 ms (0.37% GC)
  maximum time:     662.402 ms (0.58% GC)
  --------------
  samples:          8
  evals/sample:     1

and

julia> @benchmark mul2(A, xin)

BenchmarkTools.Trial: 
  memory estimate:  18.81 MiB
  allocs estimate:  4
  --------------
  minimum time:     612.841 ms (0.00% GC)
  median time:      648.925 ms (0.00% GC)
  mean time:        662.252 ms (0.03% GC)
  maximum time:     745.093 ms (0.00% GC)
  --------------
  samples:          8
  evals/sample:     1

with a 600.000 x 600.000 sparse matrix A and correspondingly large xin.

Why are both functions equally fast, even though the second needs only a fraction of memory allocations and garbage collection calls?

Your matrices and vectors are so large that allocation and garbage collection take very little time compared to a single matrix-vector multiplication.

2 Likes

Makes sense. Then I will stick with the first version because it is better legible and the performance penalty barely existent.

Also, this code keeps allocating and releasing vectors of the same size, so it’s a relatively straightforward memory pattern: keep bump allocating temporary vectors until a GC happens, then start over with probably the very same memory.

If you’re going to be doing the same computation with different vectors but the same A^n, it may be much faster to precompute that matrix and do a single matvec. If n is sufficiently large, it may be faster even if you use A^n only a single time since A^n can be computed with log₂(n) matmuls, which could be cheaper than n matvecs. There are accuracy considerations too, although it’s not obvious to me which would be more accurate.

2 Likes

I was surprised by your result until I noticed this line. In-place sparse linear algebra has relatively few advantages over just allocating new sparse arrays, which is in sharp contrast to the situation with dense linear algebra.

I don’t know if this is beneficial for you use case, but if your sparse matrices have few relevant singular values, you can take advantage of the fact that

A = U ~ S ~V^\dagger

where S is a diagonal matrix of positive definite singular values. Hence,

\begin{aligned} A^n &= U ~S ~V^\dagger ~U ~ S~V^\dagger ~...~U ~ S ~V^\dagger\\ & = U ~ S^n ~ V^\dagger \end{aligned}

so

A^n x = U ~(S^n~(V^\dagger x)))

which can be computer quite efficiently.

3 Likes

I think your trick is great (And well known) for dense Linear Algebra.
As for most Sparse matrices U and V will be dense.
Now think of a huge Sparse Matrix (Square if you want). Those matrices will have the size of the number of Rows squared and the number of columns squared.
Hence you may even not be able to allocate them not speaking of computing them.

I’m totally open to the idea that this is almost never worth it for sparse matrices, but as I said, if few singular values are needed to approximate A, then one can use something like Arpack.svds to get only a few singular values. Suppose A is 600_000 x 600_000 (as with the original problem) and you only need 5 singular values (this is just a random assumption I’m making), then Arpack can give you a U which is 600_000 x 5 and a Vt which is 5 x 600_000. This factorization will in fact fit into memory fairly well.

I agree it is totally depends on the sparsity of the Singular Values.
I wonder if there are known and tight bounds on the amount of non zeros singular values given some statistics about the sparsity of the matrix.

I am not sure if it makes sense even since Identity matrix is highly sparse yet it has no vanishing singular values.

Many important, gigantic sparse matrices in physics have very few singular values and I’m sure other fields have examples too. If you want an easy way to generate test matrices with n singular values, you can just take n outer products between sparse vectors.

julia> using Arpack, SparseArrays

julia> let N = 600_000, nsv = 5, ρ=0.01
           A = sum(1:nsv) do _
               sprandn(N, ρ)*sprandn(N, ρ)'
           end
           density(A::SparseMatrixCSC) = nnz(A)/length(A) 
           @show density(A)
       
           s = svds(A, nsv=2nsv)[1]
           s.S #singular values
       end
density(A) = 0.0005105040027777777
10-element Array{Float64,1}:
 6118.190119132009
 6104.728605083264
 6081.746002759836
 5991.5193616731
 5940.291756523407
    0.0
    0.0
    0.0
    0.0
    0.0

It seems that with some tests, the method I proposed is slower than the method that the OP is using, at least for the examples I can construct but it does in fact work even for ridiculously large matrices. In a similar vein to what @StefanKarpinski said, the extra work may done to do this factorization may pay off if you’re doing other operations that can be accelerated by having the factorization already computed.

I needed a fast way to compute A^n * x as part of implementing the Lanczos Algorithm, so in a way finding the singular values of the matrix is precisely what I needed for… In fact, my matrix is even hermitian, so we get

A = U S U^\dagger \quad \Rightarrow \quad A^n = U S^nU^\dagger.

But finding S is precisely what I need to compute A^n for.

Ah, I should have expressed it more clearly: While A is a sparse matrix, xin is a dense vector. Even if I started with a sparse vector xin I would get that A^n * xin is dense for sufficiently large n. Is your statement still valid in this case?

Ah, I see well if you’re looking for a pure Julia implementation of the Lanczos algorithm, IterativeSolvers.jl uses it.

Yeah, that would be my expectation.