Performance of `exp(A)` for 9x9 anti-Hermitian matrix: Julia vs. PyTorch vs. MATLAB (CPU & GPU)

stevengj · August 19, 2025, 2:12pm

Note that this package is mainly for exp(A) for real anti-Hermitian matrices. For the complex anti-Hermitian case, it doesn’t have any advantage over cis(Hermitian(-im*A)) IIRC.

goerz · August 19, 2025, 4:23pm

Have a look at QuantumPropagators.Cheby, and probably use it in combination with StaticArrays. If your “batch” has a common spectral envelope (e.g., these are the the same time-dependent Hamiltonian evaluated at points in time), the propagation |Ψ(t)⟩ = exp(-i H t) |Ψ(0)⟩ boils down to a handful of matrix-vector multiplications, which should be extremely fast.

In general, calculating the application of U = exp(-i H t) to a state is more efficient than calculating the matrix U (matrix-vector operations vs matrix-matrix operations). For very small matrices, your mileage may vary, as well as if you need to apply the exact same U to a lot of different states. But even then, an expansion into Chebychev polynomials would be an efficient way to calculate U, and it would be fairly straightforward to adapt the code to that.

Some benchmarks:

julia> using BenchmarkTools

julia> using QuantumPropagators.Cheby

julia> using QuantumControlTestUtils.RandomObjects: random_matrix, random_state_vector

julia> Ψ = random_state_vector(9; );

julia> H = random_matrix(9; hermitian=true); dt = 1.0;  # spectral radius ≈ 1.0

julia> @btime exp(-1im * $H * $dt) * $Ψ;
  3.724 μs (28 allocations: 14.42 KiB)

julia> A = -1im * H * dt;

julia> @btime exp($A);
  3.412 μs (22 allocations: 11.47 KiB)

julia> @btime ChebyWrk($Ψ, 2.0, -1.0, $dt);  # one-time calculations
  371.966 ns (15 allocations: 1.48 KiB)

julia> wrk = ChebyWrk(Ψ, 2.0, -1.0, dt);

julia> @btime cheby!($Ψ, $H, $dt, $wrk);
  938.360 ns (0 allocations: 0 bytes)

With StaticArrays:

julia> using StaticArrays

julia> Ψ = SVector{9}(random_state_vector(9; ));

julia> H = SMatrix{9,9}(random_matrix(9; hermitian=true));

julia> @btime exp(-1im * $H * dt) * $Ψ;
  4.613 μs (5 allocations: 4.30 KiB)

julia> @btime ChebyWrk($Ψ, 2.0, -1.0, $dt);
  358.057 ns (9 allocations: 1.27 KiB)

julia> wrk = ChebyWrk(Ψ, 2.0, -1.0, dt);

julia> @btime cheby($Ψ, $H, $dt, $wrk);
  367.823 ns (0 allocations: 0 bytes)

But these depend on the spectral range of your H, so you should run your own.

mikmoore · August 19, 2025, 5:01pm

exp from StaticArrays.jl appears to perform extremely (and unnecessarily) poorly above tiny sizes. It could really use some love.

I got 2x performance on SMatrix{9, 9, Float32, 81} just by replacing the lines (V - U) \ (V + U) with lu(V - U) \ (V + U) (which is to say that \ might also benefit from some changes or less inlining, since these should be essentially identical).

With some no-op changes to the code – mostly shuffling the control flow, moving the @evalpolys to separate functions, and replacing them with evalpoly (@evalpoly’s inability to outline seems horrible with SMatrix) – I was able to further boost this to 3x while massively reducing the TTFX (30s down to 2.5s – bad but better). I didn’t assess which parts of my changes were actually responsible for improvement, though a proper evaluation should.

Outlining the Pade approximants seems particularly useful because it makes actual algorithmic changes much cleaner to implement. For example, neither StaticArrays or LinearAlgebra coarsen the approximation for Float32 although they probably could a little. And there’s no reason for exp (StaticArrays or generic LinearAlgebra) to have a custom inline Paterson-Stockmeyer polynomial evaluation when that could be built into evalpoly for all to use. (EDIT: wrong: there is efficiency in recycling parts of the PS evaluation but I still think having all the Pade’s defined inline in the function is excessive)

draftman9 · August 20, 2025, 2:28pm

Thank you for this incredibly insightful explanation. This is a huge help and clarifies many of the confusing results I’ve been seeing. I will also add this note into the main poster according to your comments.

Yeah I know. In my experience, in many cases, “pre-allocating” only can obtain a few advantages.

I did some test and found SkewLinearAlgebra.jl can take advantages on large matrix. Here is the testing codes:

using BenchmarkTools
using SkewLinearAlgebra, LinearAlgebra

num_threads = 1
N = 9

B = rand(N, N)
A_real = B - B'
A_real_s = SkewHermitian(A_real)

A_imag = collect(im * Symmetric(rand(N, N)))
A_imag_s = SkewHermitian(A_imag)

println("\n--- Testing on $(num_threads) thread(s) ---")
LinearAlgebra.BLAS.set_num_threads(num_threads)
println("BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())

@btime exp($A_real)
@btime exp($A_real_s)
@btime exp($A_imag)
@btime exp($A_imag_s)

print()

Here is the results for N = 9 (9x9 matrix):

--- Testing on 1 thread(s) ---
BLAS threads: 1
  2.178 μs (9 allocations: 4.73 KiB)
  4.543 μs (28 allocations: 11.14 KiB)
  4.243 μs (9 allocations: 8.50 KiB)
  9.700 μs (18 allocations: 16.39 KiB)

Here is the results for N = 900 (900x900 matrix):

--- Testing on 1 thread(s) ---
BLAS threads: 1
  159.045 ms (15 allocations: 37.09 MiB)
  113.632 ms (40 allocations: 64.79 MiB)
  673.745 ms (15 allocations: 74.17 MiB)
  453.316 ms (24 allocations: 74.91 MiB)

draftman9 · August 20, 2025, 2:44pm

Thank you both! In my real cases, I need solve this differential equation with 2D coordinate space:

\frac{d}{dt}|\psi(\mathbf{R}, t)\rangle = -i \hat H(\mathbf{R}, t)|\psi(\mathbf{R}, t)\rangle

where \mathbf{R} is position vector. I’ve never used DifferentialEquations.jl in the past, so I need some time to think about how to use it.

stevengj · August 20, 2025, 4:51pm

In the complex case, you should compare it to cis(Hermitian(-im*A)), which takes advantage of the symmetry too, and gets roughly the same single-threaded performance for large complex skew-Hermitian A:

julia> using SkewLinearAlgebra, LinearAlgebra, BenchmarkTools

julia> n = 900; A_s = skewhermitian!(randn(ComplexF64, n, n)); A = Matrix(A_s);

julia> BLAS.set_num_threads(1)

julia> @btime exp($A); # does not exploit symmetry
  1.660 s (26 allocations: 74.27 MiB)

julia> @btime cis(Hermitian(-im * $A)); # exploits symmetry
  436.643 ms (38 allocations: 75.03 MiB)

julia> @btime exp($A_s); # exploits symmetry
  453.617 ms (47 allocations: 75.07 MiB)

As I said, the advantage of SkewLinearAlgebra is mainly for the real case (where cis(-im * A) is much more expensive because it converts real to complex):

julia> n = 900; A_s = skewhermitian!(randn(n, n)); A = Matrix(A_s); # real skew-Hermitian

julia> exp(A) ≈ real(cis(Hermitian(-im * A))) ≈ exp(A_s)
true

julia> @btime exp($A); # does not exploit symmetry
  376.459 ms (26 allocations: 37.14 MiB)

julia> @btime real(cis(Hermitian(-im * $A))); # exploits symmetry but complexifies
  442.979 ms (41 allocations: 81.22 MiB)

julia> @btime exp($A_s); # exploits symmetry, stays real
  260.892 ms (74 allocations: 64.97 MiB)

stevengj · August 20, 2025, 5:04pm

Note that DifferentialEquations.jl will be especially advantageous over explicit matrix exponentials if your Hamiltonian operator \hat{H}(t) is large and sparse (e.g. it is -\nabla^2 + V where \nabla^2 is a sparse finite-difference operator and V is diagonal) or otherwise can be multiplied by vectors quickly (e.g. parts of it are diagonal in Fourier space, as in DFT calculations in a planewave basis). Even if you wanted an explicit matrix exponential in such cases, you would be better off computing exponential–vector products by iterative methods.

But you were testing on tiny 9x9 matrices, which is confusingly small if your real problem involves a PDE-like operator. Methods and performance characteristics for tiny dense matrices are very different from those for large sparse matrices!

andreasvarga · August 27, 2025, 1:16pm

Is the matrix H(t) periodic?

draftman9 · August 28, 2025, 1:14am

No. Actually I need solve the differential equation below numerically.

i\frac{\partial}{\partial t} \begin{pmatrix} \chi_1(\mathbf{R},t) \\ \chi_2(\mathbf{R},t) \\ \chi_3(\mathbf{R},t) \\ \vdots \end{pmatrix} = \begin{pmatrix} T+V_1(\mathbf{R}) & \vec{\mu}_{12}(\mathbf{R}) \cdot \vec{E}(t) & \vec{\mu}_{13}(\mathbf{R}) \cdot \vec{E}(t) & \cdots \\ \vec{\mu}_{12}(\mathbf{R}) \cdot \vec{E}(t) & T+V_2(\mathbf{R}) & \vec{\mu}_{23}(\mathbf{R}) \cdot \vec{E}(t) & \cdots \\ \vec{\mu}_{13}(\mathbf{R}) \cdot \vec{E}(t) & \vec{\mu}_{23}(\mathbf{R}) \cdot \vec{E}(t) & T+V_3(\mathbf{R}) & \cdots \\ \vdots & \vdots & \vdots & \ddots \end{pmatrix} \begin{pmatrix} \chi_1(\mathbf{R},t) \\ \chi_2(\mathbf{R},t) \\ \chi_3(\mathbf{R},t) \\ \vdots \end{pmatrix}

At every position lattice \mathbf{R} and every time step t, expm of 9x9 matrix should apply on the quantum state. Neither the coulping \vec{\mu}_{13}(\mathbf{R}) nor the \cos^2-envelop pulse \vec{E}(t) is periodic.

danielwe · August 28, 2025, 2:25am

I’d take a good, long look at @goerz’s QuantumPropagators.jl (I’ve never used it, but that’s just because my work hasn’t been very quantum in the last several years). I would also check out the specialized Lie group ODE solvers in DifferentialEquations.jl, which you can read about here: https://docs.sciml.ai/DiffEqDocs/stable/solvers/nonautonomous_linear_ode/#State-Dependent-Solvers.

expm is great for time-independent problems, but I don’t know if there’s a good reason to prefer it over these specialized integrators when your Hamiltonian is time-dependent.

Topic		Replies	Views
Speed of exp function for skew hermitian matrices New to Julia	12	919	February 17, 2022
Speeding up matrix exponential and matrix multiplication Performance question , speed-optimization , matrix	58	1922	August 1, 2023
Custom function for eigenvectors designed for my sparse matrix can't beat eigen Numerics linearalgebra	17	1210	December 27, 2020
Reusing eigendecomposition for matrix exponentials General Usage	6	372	November 21, 2020
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8522	March 5, 2017

Performance of `exp(A)` for 9x9 anti-Hermitian matrix: Julia vs. PyTorch vs. MATLAB (CPU & GPU)

Related topics