Performance of `exp(A)` for 9x9 anti-Hermitian matrix: Julia vs. PyTorch vs. MATLAB (CPU & GPU)

Hi everyone,

I’m a Ph.D. student in physics. A core part of my research involves simulating the time evolution of quantum systems, which boils down to frequently calculating the matrix exponential U = exp(A).

For these simulations, the matrix A has a specific structure: it is anti-Hermitian. In my test case, I construct it as A = iH (A=i\hat{H}), where H is a real, symmetric 9x9 matrix with a zero diagonal. This operation is the computational bottleneck in many of my simulations, so I’m trying to find the most performant solution for a large batch of these calculations.

I’ve been benchmarking Julia’s performance against PyTorch and MATLAB. I wanted to share my findings. Any advice using any tool or language would be appreciated!

ps: What I have tried for Julia:

  • ExponentialUtilities.expv (Krylov subspace method): Slower than the standard library exp.
  • StaticArrays.jl ( Static type): Also slower than exp(A)
    • Maybe not small enough
  • SkewLinearAlgebra.jl ( Skew-Hermitian type): Also slower than the standard library exp(A).
    • Optmize for large matrix, like 900x900.
  • FastExpm.jl : Also slower than the standard library exp(A).
  • ExponentialUtilities.exponential!(A) & LinearAlgebra.exp!(A): Performance is similar to exp(A).
  • Wrapper A by the type Symmetric: Performance is similar to exp(A).
  • cis(H): Performance is similar to exp(iH).
    • Optmize for large matrix, like 900x900.
  • CUDA.jl: The standard library LinearAlgebra.exp is not GPU-compatible for general matrix sizes due to internal checks that perform disallowed scalar indexing. This forced me to write a manual workaround.

Benchmark Setup

  • Matrix A: 9x9 anti-Hermitian, purely imaginary, symmetric and dense matrix with a zero diagonal.
  • Operation: Batched matrix exponential.
  • Batch Size (loop): 524,288 operations (loop = 2 * 2^9 * 2^9).
  • CPU: AMD R9-9950X
  • GPU: Nvidia GeForce RTX 5070 Ti
  • Versions:
    • Julia: 1.10.5
    • PyTorch: 2.9.0.dev20250716+cu129
    • MATLAB: R2023b

Results

Here are the results from my machine. The total time is obtained by executing loop = 2 * 2^9 * 2^9, converted to seconds for easier comparison.

Language / Framework Device Total Time (s)
PyTorch GPU 0.55
Julia CPU (Single core) 2.57
MATLAB CPU (Single core) 6.93

Specific Questions for the Community

  1. For Julia Experts: Is there a better way to perform a batched matrix exponential on the GPU? Am I missing a library or a technique that provides a batched expm?

  2. For MATLAB Experts: Is the manual for loop over GPU slices the idiomatic way to handle this, or does a more “vectorized” or batched function exist that I’m unaware of?

  3. For Python/PyTorch Experts: PyTorch is the winner here ignoring the fair about hardware, but is it the performance ceiling for the Python ecosystem? I’m particularly interested in:

    • Alternative Frameworks: How would JAX (with its XLA compiler) or CuPy perform on this task? Would jax.scipy.linalg.expm or a CuPy equivalent potentially offer lower overhead or faster execution?
    • PyTorch Optimizations: Am I missing any performance tricks within PyTorch itself? For a problem of this scale, are there lower-level approaches (like using Triton) that could yield further speedups?
  4. General Algorithmic Advice: Given that my matrix A is always anti-Hermitian (meaning exp(A) is unitary), are there specialized algorithms I should be using instead of a general-purpose expm? Could this property be exploited for a significant performance gain in any of these languages?


Code

The code for all three languages is designed to be as equivalent as possible.

PyTorch Code
import torch
import time
import numpy as np

# 1. Set test parameters
N = 9
# In PyTorch, 'loop' is the batch size
loop = 2 * 2**9 * 2**9  # 4 * 2^9 * 2^9 = 1,048,576

# Auto-select device (GPU preferred)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

print(f"Testing {N}x{N} matrix exponential, Batch Size: {loop}")
print(f"Using device: {device.upper()}")

# Create a batch of purely imaginary, symmetric matrices with zero diagonals.
A = torch.rand(loop, N, N, dtype=torch.float64, device=device) # 1. Start with a batch of random real matrices.
A.diagonal(dim1=-2, dim2=-1).zero_()                            # 2. Set the main diagonal of each matrix to zero.
A = A + A.transpose(-2, -1)                                    # 3. Make each matrix symmetric (A = A^T).
A = A * 1j                                                     # 4. Make each matrix purely imaginary.

# 2. Define the test function (operates on the whole batch)
def test_exp_batch(matrices):
    # torch.matrix_exp handles a batch of matrices directly
    return torch.matrix_exp(matrices)

# 3. Run the test
print("\n--- Starting performance test ---")

# First run for warmup and JIT/CUDA compilation
_ = test_exp_batch(A)

# If using GPU, synchronize to ensure completion
if device == 'cuda':
    torch.cuda.synchronize()

# Second run for accurate timing
print(f"\nPyTorch exp (batch, {device.upper()}) time:")
start_run_time = time.perf_counter()

# Perform the core computation
result = test_exp_batch(A)

# Synchronize again for accurate GPU timing
if device == 'cuda':
    torch.cuda.synchronize()

end_run_time = time.perf_counter()

print(f"\nTotal time: {end_run_time - start_run_time:.6f} seconds")
Julia Code
using LinearAlgebra

# 1. Set test parameters
num_threads = 1
N = 9
const loop = 2 * 2^9 * 2^9 # 4 * 2^9 * 2^9 = 1,048,576

println("Testing $N x $N matrix exponential, $loop loops")
# Create a purely imaginary, symmetric matrix with a zero diagonal.
A = rand(N, N)      # 1. Start with a random real matrix.
for i = 1:size(A,1)
    A[i,i] = 0      # 2. Set the main diagonal to zero.
end
A = A + transpose(A) # 3. Make it symmetric (A = A^T).
A = im*A             # 4. Make it purely imaginary.

# 2. Define the test function
function test_exp(A, loop)
    for i in 1:loop
        exp(A) # exp() on a matrix is the matrix exponential
    end
end

# 3. Set thread count and run the test
println("\n--- Testing on $(num_threads) thread(s) ---")
LinearAlgebra.BLAS.set_num_threads(num_threads)
println("BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())

# First run for warmup and JIT compilation
@time test_exp(A, loop);

# Second run for accurate timing
println("\nJulia exp ($(num_threads) thread(s)) time:")
@time test_exp(A, loop);
MATLAB Code
clear;

% 1. Set test parameters
num_threads = 1;
N = 9;
loop = 2 * 2^9 * 2^9; % 4 * 2^9 * 2^9 = 1,048,576

fprintf('Testing %dx%d matrix exponential, %d loops\n', N, N, loop);

% 2. Create a purely imaginary, symmetric matrix with a zero diagonal.
A = rand(N, N);      % 1. Start with a random real matrix.
A(1:N+1:end) = 0;    % 2. Set the main diagonal to zero.
A = A + A.';         % 3. Make it symmetric (A = A^T).
A = 1i * A;          % 4. Make it purely imaginary.

% 3. Set thread count and display
maxNumCompThreads(num_threads);
fprintf('\n--- Testing on %d thread(s) ---\n', maxNumCompThreads);

% 4. Run the test
% First run for warmup
B = expm(A); % In MATLAB, matrix exponential is expm()

% Second run for accurate timing
fprintf('\nTiming...\n');
tic; % Start timer
for i = 1:loop
    B = expm(A);
end
elapsedTime = toc; % Stop timer

fprintf('MATLAB expm (%d thread(s)) time: %f seconds\n', maxNumCompThreads, elapsedTime);

Update based on community feedback

User @yolhan_mannes pointed out a key reason for the performance difference: PyTorch utilizes a highly optimized batched implementation for matrix_exp on the GPU, while the Julia ecosystem may not have a readily available equivalent yet.

The suggested path forward is to implement a batched version of the standard “Padé Approximant + Scaling & Squaring” algorithm, likely using custom CUDA kernels.

Is anyone in the community currently working on or aware of efforts to implement batched linear algebra solvers, specifically for functions like exp(), in packages like GPUArrays.jl, or elsewhere? If not, what would be a good starting point for someone with a physics background (strong in math, but newer to writing custom CUDA kernels) to begin contributing to such a feature?

This seems like a crucial piece of infrastructure for leveraging Julia’s full potential in scientific domains like quantum simulation.

4 Likes

but it’s running on GPU? so not really an :applecomputer: to :applecomputer: comparison?

Actually I want a method or tool to speedup so I focus on the time cost instead of the absolute fair. BTY, GPU is fast but the GPU memory is expensive.

I see, in that case:

  • why don’t you use more threads? If you can batch on GPU, you can surely multi-threads on CPU
  • Have you tried ExponentialUtilities.exponential! with pre-allocated memory?

You are right. The parallel method is simple, so till now I am looking for the high-performance single-core method and tool. My computer has 16 physical cpu cores, so I think parallel tech will obtain 16x speedup, which will faster than pytorch.

Yes. Just forgot to note it.


I will improve my poster according to your advice.

1 Like

nice. As to how to parallelize, you can either use a 3-D matrix and take the slice, or you can use Home · ArraysOfArrays so you can get semantically a “vector of matrix” for easier manipulation but still have contiguous physical layout under the hood

2 Likes

Looks so great! Very suit for parallel.

A few ideas:

  • Have you tried storing your matrix with StaticArrays.jl?
  • Have you tried wrapping it in Symmetric to encode its structure in the type (and possibly dispatch on smarter algorithms, depending on the package)?
  • Perhaps Reactant.jl might have a batched matrix exponential @avikpal ?
1 Like

Yes. Adding A = SMatrix{N, N}(A) after the code A = im*A of julia script in the main poster will obtain a Static Matrix SMatrix{9, 9, ComplexF64, 81} with indices SOneTo(9)×SOneTo(9). However, this makes it slower. This is the result:

Julia exp (1 thread(s)) time:
  7.170662 seconds

Adding A = Symmetric(A) after the code A = im*A of julia script in the main poster will obtain a Symmetric{ComplexF64, Matrix{ComplexF64}}. However, this performance is similar to the normal Matrix A (For the normal one, see the table in the main poster). This is the result:

Julia exp (1 thread(s)) time:
  2.575725 seconds (4.72 M allocations: 4.250 GiB, 0.56% gc time)

I will add Symmetric item into the main poster according to your advice.

1 Like

How do you measure the runtime?

@time actually. For more details, see the codes in the main poster.

You can also try SkewLinearAlgebra.jl. See this example

1 Like

As for julia, I think we really miss a lot of batched solvers, we have everything to write them ie, CUBLAS and CUSOLVER however I think they are not right now, it’s just tricky to get right but if you know the alg enough don’t hesitate to try implementing it ( for your case (non-hermitian) it is the Padé + scaling & squaring batched alg)

Unluckily, this special pkg is even little slower than the standard exp. Adding A = SkewHermitian(A) after the code A = im*A of julia script in the main poster will obtain a SkewHermitian{ComplexF64, Matrix{ComplexF64}}. However, this makes it slower. This is the result:

Julia exp (1 thread(s)) time:
  6.100191 seconds (9.44 M allocations: 8.195 GiB, 4.43% gc time)

This result is surprising, so I test the example given by the official doc. Here is the script which only includes little modification:

using BenchmarkTools
using SkewLinearAlgebra, LinearAlgebra

num_threads = 1
A_ori = [0   1 2
         -1  0 3
         -2 -3 0]
A = SkewHermitian(A_ori)

println("\n--- Testing on $(num_threads) thread(s) ---")
LinearAlgebra.BLAS.set_num_threads(num_threads)
println("BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())

@btime exp(A_ori)
@btime exp(A)
@btime exp($A_ori)
@btime exp($A)

print()

The output is below:

--- Testing on 1 thread(s) ---
BLAS threads: 1
  707.246 ns (9 allocations: 1.08 KiB)
  876.596 ns (28 allocations: 3.02 KiB)
  697.315 ns (9 allocations: 1.08 KiB)
  864.815 ns (28 allocations: 3.02 KiB)

You can see this special pkg is even little slower than standard exp.


I will add SkewLinearAlgebra.jl item into the main poster according to your advice.

Another question: Are all of your matrices totally different or are they similar but scaled?
Background: A common case in QM is computing the time evolution of some state via exp(-i*t*H). In this case, it should be most efficient to factorize H and then reuse the factorization for all of the exponentials.

Also something that your tests neglect is that you can preallocate workspaces and then reuse that memory. So the benchmarks here might not reflect the actual performance you’d get.

1 Like

Thanks for your reminding. But the background actually is \exp[-i\Delta t \hat{H}(t)], where \Delta t is constant and \hat{H}(t) is time-dependent.

Thanks for your reminding again. I will update the last comment.


And in my mind, in the most cases, @btime foo($var) is little faster than @btime foo(var).

A lot of what you are seeing here is that most of these functions are optimized for large matrices, and for 9x9 matrices the bottlenecks completely change. For example, cis(H) takes advantage of the Hermitian nature of H whereas exp(im*H) does not, which makes a big difference for large arrays but for small arrays the overhead dominates:

julia> BLAS.set_num_threads(1); # single-threaded benchmarks are clearer

julia> H = hermitianpart(rand(9,9)); @btime exp($(im * H)); @btime cis($H);
  3.578 μs (18 allocations: 8.78 KiB)
  5.565 μs (22 allocations: 8.73 KiB)

julia> H = hermitianpart(rand(900,900)); @btime exp($(im * H)); @btime cis($H);
  1.664 s (26 allocations: 74.27 MiB)
  167.075 ms (31 allocations: 43.69 MiB)

For 9x9 matrices you really need to pre-allocate everything so that you can work in-place as much as possible, probably calling low-level routines.

In principle, StaticArrays.jl should help with tiny matrices like this. However, matrix exponentiation and eigensolvers are hard to implement, so in practice it ends up calling back to LAPACK except for a few tiny cases that can be hard-coded (e.g. 2x2 and 3x3).

2 Likes

Ah, so you are actually solving a differential equation:
\frac{d}{dt}|\psi(t)\rangle = -i \hat H(t)|\psi(t)\rangle
by writing your own solver with a fixed timestep. You could also try to use DifferentialEquations.jl to solve this. Perhaps this could be more efficient/accurate due to better error estimates and more sophisticated solvers. (Note: the default tolerances are rather large. You’ll likely need to set them to smaller values to get good speed and accurate results.)

To test this, you would probably need to come up with a more realistic benchmark case. Which would also help to make the other benchmarks more realistic.

1 Like

I tried this with cis(H), implementing a version mycis!(...) that lets you pre-allocate all of the workspaces, and I got an allocation-free method, but it is still only comparable on my machine to exp(im*H) (which uses a different algorithm, not based on eigenvectors, but which doesn’t exploit symmetry).

H = hermitianpart(randn(9,9))
Hwork = mycis_workspace(H)
C = complex(H.data) # workspace for result
@show mycis!(C, H, Hwork...) ≈ cis(H)

using BenchmarkTools

BLAS.set_num_threads(1)
@btime cis($H);
@btime exp($(im*H));
@btime mycis!($C, $H, $Hwork...);

prints:

  4.637 μs (22 allocations: 8.73 KiB)
  3.970 μs (18 allocations: 8.78 KiB)
  3.988 μs (0 allocations: 0 bytes)

An even faster algorithm might be to preallocate the exp(im*H) algorithm, specializing it for the Hermitian case if possible. But the basic problem is that routines LAPACK algorithms are really optimized more for large matrices.

Code
using LinearAlgebra
using LinearAlgebra.LAPACK: @chkvalidparam, chkstride1, chkuplofinite, BlasInt, @blasfunc, libblastrampoline, chklapackerror, checksquare

for (syevr, elty) in ((:dsyevr_,:Float64), (:ssyevr_,:Float32))
    @eval function syevr!(uplo::AbstractChar, A::AbstractMatrix{$elty},
                          W::Vector{$elty}, Z::Matrix{$elty},
                          isuppz::Vector{BlasInt}, work::Vector{$elty}, iwork::Vector{BlasInt})
        Base.require_one_based_indexing(A)
        jobz, range, vl, vu, il, iu, abstol  = 'V', 'A', 0.0, 0.0, 0, 0, -1.0
        chkstride1(A)
        n = checksquare(A)
        chkuplofinite(A, uplo)
        lda = stride(A,2)
        m = Ref{BlasInt}()
        length(W) == n || throw(DimensionMismatch("wrong W size"))
        ldz = n
        size(Z) == (ldz, n) || throw(DimensionMismatch("wrong Z size"))
        length(isuppz) == 2n || throw(DimensionMismatch("wrong isuppz size"))
        lwork  = BlasInt(length(work))
        liwork = BlasInt(length(iwork))
        info   = Ref{BlasInt}()
        ccall((@blasfunc($syevr), libblastrampoline), Cvoid,
            (Ref{UInt8}, Ref{UInt8}, Ref{UInt8}, Ref{BlasInt},
                Ptr{$elty}, Ref{BlasInt}, Ref{$elty}, Ref{$elty},
                Ref{BlasInt}, Ref{BlasInt}, Ref{$elty}, Ptr{BlasInt},
                Ptr{$elty}, Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt},
                Ptr{$elty}, Ref{BlasInt}, Ptr{BlasInt}, Ref{BlasInt},
                Ref{BlasInt}, Clong, Clong, Clong),
            jobz, range, uplo, n,
            A, max(1,lda), vl, vu,
            il, iu, abstol, m,
            W, Z, max(1,ldz), isuppz,
            work, lwork, iwork, liwork,
            info, 1, 1, 1)
        chklapackerror(info[])
        W, Z
    end
end

function mycis_workspace(H::Hermitian)
    m = size(H,1)
    Zc = similar(complex(H.data))
    W1 = similar(H.data)
    W = Vector{real(eltype(H))}(undef, m)
    Wc = Vector{complex(eltype(H))}(undef, m)
    Z = similar(H.data)
    isuppz = Vector{BlasInt}(undef, 2m)
    work = Vector{eltype(H.data)}(undef, 128m)
    iwork = Vector{BlasInt}(undef, 128m)
    return Zc, W1, W, Wc, Z, isuppz, work, iwork
end

function mycis!(C::Matrix, H::Hermitian, Zc, W1, W, Wc, Z, isuppz, work, iwork)
    A = Hermitian(copyto!(W1, H.data), ifelse(H.uplo == 'U', :U, :L))
    syevr!(H.uplo, A.data, W, Z, isuppz, work, iwork)
    mul!(Zc, Z, Diagonal(Wc .= cis.(W)))
    mul!(C, Zc, Z')
    return C
end

This is definitely worth trying, especially using SMatrix to specialize for 9x9, assuming H is varying mostly smoothly with time — I’m guessing a high-order scheme will be much better than using a simple fixed-timestep exponential integrator.

1 Like