Hi everyone,
I’m a Ph.D. student in physics. A core part of my research involves simulating the time evolution of quantum systems, which boils down to frequently calculating the matrix exponential U = exp(A)
.
For these simulations, the matrix A
has a specific structure: it is anti-Hermitian. In my test case, I construct it as A = iH
(A=i\hat{H}), where H
is a real, symmetric 9x9 matrix with a zero diagonal. This operation is the computational bottleneck in many of my simulations, so I’m trying to find the most performant solution for a large batch of these calculations.
I’ve been benchmarking Julia’s performance against PyTorch and MATLAB. I wanted to share my findings. Any advice using any tool or language would be appreciated!
ps: What I have tried for Julia:
ExponentialUtilities.expv
(Krylov subspace method): Slower than the standard libraryexp
.StaticArrays.jl
( Static type): Also slower thanexp(A)
- Maybe not small enough
SkewLinearAlgebra.jl
( Skew-Hermitian type): Also slower than the standard libraryexp(A)
.- Optmize for large matrix, like 900x900.
FastExpm.jl
: Also slower than the standard libraryexp(A)
.ExponentialUtilities.exponential!(A)
&LinearAlgebra.exp!(A)
: Performance is similar toexp(A)
.- Wrapper
A
by the typeSymmetric
: Performance is similar toexp(A)
. cis(H)
: Performance is similar toexp(iH)
.- Optmize for large matrix, like 900x900.
CUDA.jl
: The standard libraryLinearAlgebra.exp
is not GPU-compatible for general matrix sizes due to internal checks that perform disallowed scalar indexing. This forced me to write a manual workaround.
Benchmark Setup
- Matrix A: 9x9 anti-Hermitian, purely imaginary, symmetric and dense matrix with a zero diagonal.
- Operation: Batched matrix exponential.
- Batch Size (
loop
): 524,288 operations (loop = 2 * 2^9 * 2^9
). - CPU:
AMD R9-9950X
- GPU:
Nvidia GeForce RTX 5070 Ti
- Versions:
- Julia:
1.10.5
- PyTorch:
2.9.0.dev20250716+cu129
- MATLAB:
R2023b
- Julia:
Results
Here are the results from my machine. The total time is obtained by executing loop = 2 * 2^9 * 2^9
, converted to seconds for easier comparison.
Language / Framework | Device | Total Time (s) |
---|---|---|
PyTorch | GPU | 0.55 |
Julia | CPU (Single core) | 2.57 |
MATLAB | CPU (Single core) | 6.93 |
Specific Questions for the Community
-
For Julia Experts: Is there a better way to perform a batched matrix exponential on the GPU? Am I missing a library or a technique that provides a batched
expm
? -
For MATLAB Experts: Is the manual
for
loop over GPU slices the idiomatic way to handle this, or does a more “vectorized” or batched function exist that I’m unaware of? -
For Python/PyTorch Experts: PyTorch is the winner here ignoring the fair about hardware, but is it the performance ceiling for the Python ecosystem? I’m particularly interested in:
- Alternative Frameworks: How would JAX (with its XLA compiler) or CuPy perform on this task? Would
jax.scipy.linalg.expm
or a CuPy equivalent potentially offer lower overhead or faster execution? - PyTorch Optimizations: Am I missing any performance tricks within PyTorch itself? For a problem of this scale, are there lower-level approaches (like using Triton) that could yield further speedups?
- Alternative Frameworks: How would JAX (with its XLA compiler) or CuPy perform on this task? Would
-
General Algorithmic Advice: Given that my matrix
A
is always anti-Hermitian (meaningexp(A)
is unitary), are there specialized algorithms I should be using instead of a general-purposeexpm
? Could this property be exploited for a significant performance gain in any of these languages?
Code
The code for all three languages is designed to be as equivalent as possible.
PyTorch Code
import torch
import time
import numpy as np
# 1. Set test parameters
N = 9
# In PyTorch, 'loop' is the batch size
loop = 2 * 2**9 * 2**9 # 4 * 2^9 * 2^9 = 1,048,576
# Auto-select device (GPU preferred)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Testing {N}x{N} matrix exponential, Batch Size: {loop}")
print(f"Using device: {device.upper()}")
# Create a batch of purely imaginary, symmetric matrices with zero diagonals.
A = torch.rand(loop, N, N, dtype=torch.float64, device=device) # 1. Start with a batch of random real matrices.
A.diagonal(dim1=-2, dim2=-1).zero_() # 2. Set the main diagonal of each matrix to zero.
A = A + A.transpose(-2, -1) # 3. Make each matrix symmetric (A = A^T).
A = A * 1j # 4. Make each matrix purely imaginary.
# 2. Define the test function (operates on the whole batch)
def test_exp_batch(matrices):
# torch.matrix_exp handles a batch of matrices directly
return torch.matrix_exp(matrices)
# 3. Run the test
print("\n--- Starting performance test ---")
# First run for warmup and JIT/CUDA compilation
_ = test_exp_batch(A)
# If using GPU, synchronize to ensure completion
if device == 'cuda':
torch.cuda.synchronize()
# Second run for accurate timing
print(f"\nPyTorch exp (batch, {device.upper()}) time:")
start_run_time = time.perf_counter()
# Perform the core computation
result = test_exp_batch(A)
# Synchronize again for accurate GPU timing
if device == 'cuda':
torch.cuda.synchronize()
end_run_time = time.perf_counter()
print(f"\nTotal time: {end_run_time - start_run_time:.6f} seconds")
Julia Code
using LinearAlgebra
# 1. Set test parameters
num_threads = 1
N = 9
const loop = 2 * 2^9 * 2^9 # 4 * 2^9 * 2^9 = 1,048,576
println("Testing $N x $N matrix exponential, $loop loops")
# Create a purely imaginary, symmetric matrix with a zero diagonal.
A = rand(N, N) # 1. Start with a random real matrix.
for i = 1:size(A,1)
A[i,i] = 0 # 2. Set the main diagonal to zero.
end
A = A + transpose(A) # 3. Make it symmetric (A = A^T).
A = im*A # 4. Make it purely imaginary.
# 2. Define the test function
function test_exp(A, loop)
for i in 1:loop
exp(A) # exp() on a matrix is the matrix exponential
end
end
# 3. Set thread count and run the test
println("\n--- Testing on $(num_threads) thread(s) ---")
LinearAlgebra.BLAS.set_num_threads(num_threads)
println("BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())
# First run for warmup and JIT compilation
@time test_exp(A, loop);
# Second run for accurate timing
println("\nJulia exp ($(num_threads) thread(s)) time:")
@time test_exp(A, loop);
MATLAB Code
clear;
% 1. Set test parameters
num_threads = 1;
N = 9;
loop = 2 * 2^9 * 2^9; % 4 * 2^9 * 2^9 = 1,048,576
fprintf('Testing %dx%d matrix exponential, %d loops\n', N, N, loop);
% 2. Create a purely imaginary, symmetric matrix with a zero diagonal.
A = rand(N, N); % 1. Start with a random real matrix.
A(1:N+1:end) = 0; % 2. Set the main diagonal to zero.
A = A + A.'; % 3. Make it symmetric (A = A^T).
A = 1i * A; % 4. Make it purely imaginary.
% 3. Set thread count and display
maxNumCompThreads(num_threads);
fprintf('\n--- Testing on %d thread(s) ---\n', maxNumCompThreads);
% 4. Run the test
% First run for warmup
B = expm(A); % In MATLAB, matrix exponential is expm()
% Second run for accurate timing
fprintf('\nTiming...\n');
tic; % Start timer
for i = 1:loop
B = expm(A);
end
elapsedTime = toc; % Stop timer
fprintf('MATLAB expm (%d thread(s)) time: %f seconds\n', maxNumCompThreads, elapsedTime);
Update based on community feedback
User @yolhan_mannes pointed out a key reason for the performance difference: PyTorch utilizes a highly optimized batched implementation for matrix_exp
on the GPU, while the Julia ecosystem may not have a readily available equivalent yet.
The suggested path forward is to implement a batched version of the standard “Padé Approximant + Scaling & Squaring” algorithm, likely using custom CUDA kernels.
Is anyone in the community currently working on or aware of efforts to implement batched linear algebra solvers, specifically for functions like exp()
, in packages like GPUArrays.jl
, or elsewhere? If not, what would be a good starting point for someone with a physics background (strong in math, but newer to writing custom CUDA kernels) to begin contributing to such a feature?
This seems like a crucial piece of infrastructure for leveraging Julia’s full potential in scientific domains like quantum simulation.