Accelerating repeated eigenvalue searches

I need to compute the eigenvalues of 2N\times2N (N ~ 1000) complex, hermitian, dense, matrices, the number of matrices being ~100. They have the following properties:

  1. All matrices M_i take the form M_i = A + D_i, where A is common to all, and D_i are different tridiagonal matrices.
  2. The matrices are positive-semidefinite, and I am interested only in eigenvalues in the range, say, [0, 2]. I know that the number of such eigenvalues is of order 20.

Is there a way to exploit the common structure of the matrices (e.g. an advantageous factorization of A), or the desired interval, or both, to speed up the calculation of these multiple eigenspectra?

4 Likes

You can call eigvals(Hermitian(Mᵢ), 0, 2) to only compute eigenvalues in [0,2], for example.

I don’t know how to exploit an A + D_i structure, however.

Does this work substantially differently to eigvals!(Hermitian(M), 1:20), which is what I have been doing so far? Asking for the ~20 lowest eigenvalues, I then just filter for those \le 2. However, my impression was that requesting a small number had the same runtime irrespective of the number (so 1:5, etc., would take about as long), and comparable to requesting the full ~1000 eigenspectrum.

Unfortunately, whether you ask for a few eigenvalues or all the eigenvalues, the first step in the dense eigensolver is to compute the tridiagonal/Hessenberg reduction of the full matrix, which is what takes the majority of the time.

This is the piece of information I was looking for. You can try iterative methods

  1. shift invert should give you the smallest eigenvalues, you can even ask for the first 20. See for example eigsolve in KrylovKit or ArnoldiMethod
  2. you can also try Home · JacobiDavidson.jl which is kindof a newton algo to find the eigenelements, I did not try that much in the past.

For this to work, you have to solve Mi sol = rhs fast. Now, I would say that solving (I+A\D)sol = A\rhs with iterative method could be fast depending on the properties of A\D (do not form this!). Even better, I would consider that giant matrix diag(M1, ..., MN) of size 200N~200000. Indeed, ( I_{100}\otimes A)^{-1} is simple to find :wink: and so you could try shift invert on this big matrix

2 Likes

Now, I would say that solving (I+A\D)sol = A\rhs with iterative method could be fast depending on the properties of A\D

By “solving with iterative methods”, you mean you would expect that e.g. eigsolve(M) would converge rapidly? Otherwise, I’m not sure what in practice you are suggesting (if not the formation of A\D, which you warn against).

What do I benefit from considering this large matrix? I suppose it can be realized sparsely, but otherwise seems like extra compute work. I also want to know which eivenvalue to attribute to which M_i. The inverse is as hard as inverting A, which I’m not sure is not singular.

you mean you would expect that e.g. eigsolve(M) would converge rapidly?

Yes but I think eigsolve(x->M\x) would converge faster. Also, for computing x->M\x I would use linsolve with A serving as a preconditioner. A basic example is in this tutorial.

What do I benefit from considering this large matrix?

That is the title of your post! To try use threading. Being block diagonal, the eigenvalues are easy to attribute to the relevant block

Ah! Though I imagine ideally a better threading performance would be to just parallelize over the different M_i, kept separate?

yeah I would expect that to work better.

Do you think a shift and invert (without the shift, so just invert) strategy would lead to faster convergence here? If it is guaranteed that the matrix is positive definite, then all its eigenvalues are on the positive real axis, and the range [0,2] contains extremal eigenvalues (use :SR in KrylovKit.eigsolve).

it is worth a try

After some several tinkerings, some comments:

  1. Perhaps I’m miss-applying the preconditioner, but from what I see, say, cg(Mi, v; Pl = lu(A), log = true).isconverged returns false (for a random vector v). Same for changing LU to QR factorization, and cg to gmres. Here I used IterativeSolvers.jl. Perhaps I should have mentioned this before, but I expect the spectrum of M_i to be dominated by D_i whereas A is a “small perturbation” - does this make it a bad preconditioner? I am not too familiar with how to asses/pick preconditioners. (Note also A is NOT positive definite, only the sums A + D_i)
  2. One alternative is to use, say, M_1 as the preconditioner. This seems to at least converge. However, this is not much faster than M\v.
  3. Using KrylovKit.eigsolve(Mi,...) is much much faster than KrylovKit.eigsolve(v -> Mi\v, ...), and in light of the only modest gains of the linear solvers in the preceding point, I’m not sure this approach has much to gain. What was the original motivation to think this might be quicker?
  4. KrylovKit.eigsolve does beats LinearAlgebra.eigvals! for ~20ish eigenvalues that I need. However, KrylovKit allocates a lot of memory as my matrix size increases (I might need to go to larger N than expected). Any way to reduce that?
  5. Perhaps related to the previous item: Using this example to multithread over the set M_i, nthreads=4, (i) LinearAlgebra.eigvals! speeds up roughly 3x, whereas (ii) KrylovKit.eigsolve is 2x slower than single-threaded, coincidentally wiping out the advantage and giving comparable runtimes (but with the extra memory usage). Perhaps memory bandwidth issue? So from best to worse, single-thread Krylov → (multithreaded Krylov, multithreaded eigvals) → single-thread eigvals. Thoughts?
  6. I should have been more precise: My D_i are not just tridiagonal, they are in fact 2\times2 block-diagonal. Is that of any use?

Basically, you want P^{-1}M_i=I+K_i where K is a compact operator.

Just to be sure, in your notations, M = diag(M_1,\cdots, M_n)?

No, I had meant any individual M_i, now corrected. Compact K just in terms of “bandwidth”/sparsity for fast multiplication? Any conditions on its spectrum, etc.?

It’s certainly worth trying, but I’m not too optimistic that a nested iterative method like this will be faster than the dense eigensolver for a 1000x1000 dense matrix, unless the linsolve (preconditioned CG) converges extremely quickly (which requires your tridiagonal matrices D_i to not shift the spectrum of A + D_i much compared to A, i.e. it could work well if \Vert D_i \Vert \ll \Vert A \Vert).

Yes, in this case you want to do something completely different.

I guess you could try an iterative linear solve for M_i \setminus x preconditioned by D_i \setminus x, since tridiagonal linear solve is linear time, and if that converges sufficiently quickly try plugging it into eigsolve.

Not sure why this happens, but you can disable the multithreading that is within KrylovKit using KrylovKit.disable_threads(). I assume there is some conflict between the threading of BLAS in the matrix vector multiplication, and the threading in KrylovKit, although the latter is purely restricted to the “native KrylovKit” parts of the code where there are no BLAS calls or function calls to the linear operator going on.

1 Like

It seems like you suggested I can use (LU of) D_i to precondiction CG. However, turns out this is not as rapidly convergent as preconditioning GMRES with M_1 (surprising?) [GMRES fails to converge without preconditioning or with D_i,A as preconditioners]. This is the fastest iterative realization I found, about 0.5 the time of Mi \ v. (Digression: for some reason Hermitian(Mi) \ v takes longer.)

However, I think this direction might be moot. I have noticed that using eigsolve(inv(Mi), 20, :LR,...) reports 6~7 iterations, and ~70 “linear map applications” of the matrix multiplication at my typical use cases. Now, in terms of runtime, inv(Mi) itself takes a while - but turns out, less than 70 runs of preconditioned GMRES. So I might as well just compute M_i^{-1} explicitly. The subsequent runtime of eigsolve is about half as much as the inversion. The total is approximately 33% the time of the explicit eigensolver LinearAlgebra.eigvals(Hermitian(Mi), 1:20).

Finally: is the inversion trick worth it? Yes! eigsolve(Mi, 20, :SR,...) takes ~50 iterations and 430 linear map applications. Since this is more than 3x70, that is more than the combined cost of explicit inversion and eigsolveing the inverse. Thanks for all the comments, I sure learned a lot!

One final point: Another thing I forgot to mention is that the ~100 different D_i are essentially a sampling of a continuous matrix D(x). Therefore, I expect that between M_i and M_{i+1} the eigenspectrum and eigenvectors to only change by a little. Is there a way to feed all 20 eigenvectors as initial guesses to some iterative solver? It seems to me that at least KrylovKit.eigsolve only takes one initial guess vector. Perhaps looping over 20 calls to ArnoldiMethod.partialschur!? If this will dramatically reduce the number of iterations/linear maps for M_{2,3,...}, perhaps the cost of the inversion can be avoided.

Thanks for the tip. Even when I disable KrylovKit’s threads, sometimes single threaded is faster. However, it seems I regain the performance advantage of multithreading if the matrices are small enough, say below 400x400 (dense) instead of 1000x1000. As the matrix size increases this benefit vanishes and sometimes there is slight regression. Once I also BLAS.set_num_threads(1), I finally see the performance separation between my own higher-level multithreading (its mostly my “single-thread” version that slows down). So perhaps the issue is how BLAS decide to multithread over matrices of size ~1000. Curiously, the benefit I get for multithreading (still with single-threaded BLAS) is closer to 2.0x~2.5x, instead of the 4x I’d expect with Threads.nthreads==4.

What is “your own multithreading” in this case? You mean an outer threaded for loop over the ~100 matrices that you need to find the eigenvalues of? In that case I would indeed expect a speedup proportional to the number of threads, as the overhead of the threading should be small compared to the cost of the computation. However, it is indeed important then to disable threading at all lower levels, i.e. both in KrylovKit and in BLAS.

My solution follows the Julia manual. Here system is a struct that defines the physical system I am solving, and contains the precalculated matrix A. xvals is the value of the parameter with which I sample the block-diagonals D_i \in \lbrace D(x) | x \in \text{xvals}\rbrace (see my previous post). These diagonals are computed (very cheaply) on the fly inside _computeSpectrum_singlecore. args passes on my desired number of eigenvalues per value of x, as well as a flag for which solver to use -LinearAlgebra (direct), KrylovKit, or ArnoldiMethod (shift-invert and :LR).

Right now I disabled KrylovKit’s threading globally to simply comparisons, so treat it as off.

After manually inverting, it seems KrylovKit and ArnoldiMethod give similar runtime performance, with the latter using fewer allocations, so I am leaning towards making that the default.

function computeSpectrum(system, xvals; multicore = true, args...)
    # TODO: consider a heuristic when to multicore and when not to?
    if multicore
        blas_threads = BLAS.get_num_threads()
        BLAS.set_num_threads(1); # each instance should be single-threaded
        kchunks = Iterators.partition(xvals, length(xvals) Ă· Threads.nthreads())
        tasks = map(kchunks) do chunk
            Threads.@spawn _computeSpectrum_singlecore(system, chunk; args...)
        end
        spectra = fetch.(tasks)
        # TODO catch possible exceptions to guarantee this happens
        BLAS.set_num_threads(blas_threads)
        return vcat(spectra...)
    else
        return _computeSpectrum_singlecore(system, xval; args...)
    end
end

@juthohaegeman Any thoughts on a way to pass on the set of eigenvectors of M_i as the initial guesses to eigensolve of M_{i+1}?