Performance of `exp(A)` for 9x9 anti-Hermitian matrix: Julia vs. PyTorch vs. MATLAB (CPU & GPU)

Note that this package is mainly for exp(A) for real anti-Hermitian matrices. For the complex anti-Hermitian case, it doesn’t have any advantage over cis(Hermitian(-im*A)) IIRC.

1 Like

Have a look at QuantumPropagators.Cheby, and probably use it in combination with StaticArrays. If your “batch” has a common spectral envelope (e.g., these are the the same time-dependent Hamiltonian evaluated at points in time), the propagation |Ψ(t)⟩ = exp(-i H t) |Ψ(0)⟩ boils down to a handful of matrix-vector multiplications, which should be extremely fast.

In general, calculating the application of U = exp(-i H t) to a state is more efficient than calculating the matrix U (matrix-vector operations vs matrix-matrix operations). For very small matrices, your mileage may vary, as well as if you need to apply the exact same U to a lot of different states. But even then, an expansion into Chebychev polynomials would be an efficient way to calculate U, and it would be fairly straightforward to adapt the code to that.

Some benchmarks:

julia> using BenchmarkTools

julia> using QuantumPropagators.Cheby

julia> using QuantumControlTestUtils.RandomObjects: random_matrix, random_state_vector

julia> Ψ = random_state_vector(9; );

julia> H = random_matrix(9; hermitian=true); dt = 1.0;  # spectral radius ≈ 1.0

julia> @btime exp(-1im * $H * $dt) * $Ψ;
  3.724 μs (28 allocations: 14.42 KiB)

julia> A = -1im * H * dt;

julia> @btime exp($A);
  3.412 μs (22 allocations: 11.47 KiB)

julia> @btime ChebyWrk($Ψ, 2.0, -1.0, $dt);  # one-time calculations
  371.966 ns (15 allocations: 1.48 KiB)

julia> wrk = ChebyWrk(Ψ, 2.0, -1.0, dt);

julia> @btime cheby!($Ψ, $H, $dt, $wrk);
  938.360 ns (0 allocations: 0 bytes)

With StaticArrays:

julia> using StaticArrays

julia> Ψ = SVector{9}(random_state_vector(9; ));

julia> H = SMatrix{9,9}(random_matrix(9; hermitian=true));

julia> @btime exp(-1im * $H * dt) * $Ψ;
  4.613 μs (5 allocations: 4.30 KiB)

julia> @btime ChebyWrk($Ψ, 2.0, -1.0, $dt);
  358.057 ns (9 allocations: 1.27 KiB)

julia> wrk = ChebyWrk(Ψ, 2.0, -1.0, dt);

julia> @btime cheby($Ψ, $H, $dt, $wrk);
  367.823 ns (0 allocations: 0 bytes)

But these depend on the spectral range of your H, so you should run your own.

2 Likes

exp from StaticArrays.jl appears to perform extremely (and unnecessarily) poorly above tiny sizes. It could really use some love.

I got 2x performance on SMatrix{9, 9, Float32, 81} just by replacing the lines (V - U) \ (V + U) with lu(V - U) \ (V + U) (which is to say that \ might also benefit from some changes or less inlining, since these should be essentially identical).

With some no-op changes to the code – mostly shuffling the control flow, moving the @evalpolys to separate functions, and replacing them with evalpoly (@evalpoly’s inability to outline seems horrible with SMatrix) – I was able to further boost this to 3x while massively reducing the TTFX (30s down to 2.5s – bad but better). I didn’t assess which parts of my changes were actually responsible for improvement, though a proper evaluation should.

Outlining the Pade approximants seems particularly useful because it makes actual algorithmic changes much cleaner to implement. For example, neither StaticArrays or LinearAlgebra coarsen the approximation for Float32 although they probably could a little. And there’s no reason for exp (StaticArrays or generic LinearAlgebra) to have a custom inline Paterson-Stockmeyer polynomial evaluation when that could be built into evalpoly for all to use. (EDIT: wrong: there is efficiency in recycling parts of the PS evaluation but I still think having all the Pade’s defined inline in the function is excessive)

3 Likes

Thank you for this incredibly insightful explanation. This is a huge help and clarifies many of the confusing results I’ve been seeing. I will also add this note into the main poster according to your comments.

Yeah I know. In my experience, in many cases, “pre-allocating” only can obtain a few advantages.

I did some test and found SkewLinearAlgebra.jl can take advantages on large matrix. Here is the testing codes:

using BenchmarkTools
using SkewLinearAlgebra, LinearAlgebra

num_threads = 1
N = 9

B = rand(N, N)
A_real = B - B'
A_real_s = SkewHermitian(A_real)

A_imag = collect(im * Symmetric(rand(N, N)))
A_imag_s = SkewHermitian(A_imag)

println("\n--- Testing on $(num_threads) thread(s) ---")
LinearAlgebra.BLAS.set_num_threads(num_threads)
println("BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())

@btime exp($A_real)
@btime exp($A_real_s)
@btime exp($A_imag)
@btime exp($A_imag_s)

print()

Here is the results for N = 9 (9x9 matrix):

--- Testing on 1 thread(s) ---
BLAS threads: 1
  2.178 μs (9 allocations: 4.73 KiB)
  4.543 μs (28 allocations: 11.14 KiB)
  4.243 μs (9 allocations: 8.50 KiB)
  9.700 μs (18 allocations: 16.39 KiB)

Here is the results for N = 900 (900x900 matrix):

--- Testing on 1 thread(s) ---
BLAS threads: 1
  159.045 ms (15 allocations: 37.09 MiB)
  113.632 ms (40 allocations: 64.79 MiB)
  673.745 ms (15 allocations: 74.17 MiB)
  453.316 ms (24 allocations: 74.91 MiB)

Thank you both! In my real cases, I need solve this differential equation with 2D coordinate space:

\frac{d}{dt}|\psi(\mathbf{R}, t)\rangle = -i \hat H(\mathbf{R}, t)|\psi(\mathbf{R}, t)\rangle

where \mathbf{R} is position vector. I’ve never used DifferentialEquations.jl in the past, so I need some time to think about how to use it.

In the complex case, you should compare it to cis(Hermitian(-im*A)), which takes advantage of the symmetry too, and gets roughly the same single-threaded performance for large complex skew-Hermitian A:

julia> using SkewLinearAlgebra, LinearAlgebra, BenchmarkTools

julia> n = 900; A_s = skewhermitian!(randn(ComplexF64, n, n)); A = Matrix(A_s);

julia> BLAS.set_num_threads(1)

julia> @btime exp($A); # does not exploit symmetry
  1.660 s (26 allocations: 74.27 MiB)

julia> @btime cis(Hermitian(-im * $A)); # exploits symmetry
  436.643 ms (38 allocations: 75.03 MiB)

julia> @btime exp($A_s); # exploits symmetry
  453.617 ms (47 allocations: 75.07 MiB)

As I said, the advantage of SkewLinearAlgebra is mainly for the real case (where cis(-im * A) is much more expensive because it converts real to complex):

julia> n = 900; A_s = skewhermitian!(randn(n, n)); A = Matrix(A_s); # real skew-Hermitian

julia> exp(A) ≈ real(cis(Hermitian(-im * A))) ≈ exp(A_s)
true

julia> @btime exp($A); # does not exploit symmetry
  376.459 ms (26 allocations: 37.14 MiB)

julia> @btime real(cis(Hermitian(-im * $A))); # exploits symmetry but complexifies
  442.979 ms (41 allocations: 81.22 MiB)

julia> @btime exp($A_s); # exploits symmetry, stays real
  260.892 ms (74 allocations: 64.97 MiB)

Note that DifferentialEquations.jl will be especially advantageous over explicit matrix exponentials if your Hamiltonian operator \hat{H}(t) is large and sparse (e.g. it is -\nabla^2 + V where \nabla^2 is a sparse finite-difference operator and V is diagonal) or otherwise can be multiplied by vectors quickly (e.g. parts of it are diagonal in Fourier space, as in DFT calculations in a planewave basis). Even if you wanted an explicit matrix exponential in such cases, you would be better off computing exponential–vector products by iterative methods.

But you were testing on tiny 9x9 matrices, which is confusingly small if your real problem involves a PDE-like operator. Methods and performance characteristics for tiny dense matrices are very different from those for large sparse matrices!

4 Likes