@eveningsilverfox , by restructuring the code slightly in your Jtest.jl example, we can use @lmiq’s suggestion and safely incorporate multithreading for part of the computation. This provides a 20% or so speedup, not counting startup/compilation time. Whether or not this can also be used in your actual application depends on details you haven’t yet provided. More on this after the code and timing results.

Here is the contents of Jtest2.jl:

```
sing LinearAlgebra
using ExponentialUtilities
using ExponentialUtilities: alloc_mem
using MKL
BLAS.set_num_threads(1)
function computethreads()
N1 = 25; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N
#Hamiltonian matrix H
h0 = [-1 0; 0 1]
hp = [-1 0; 0 1]
hm = [-1 0; 0 1]
H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
H = [H0 zeros(ComplexF64, size(H0))
zeros(ComplexF64, size(H0)) H0]
#@show size(H)
for mn = 1:N
H[2*mn-1,2*mn] = H[2*mn,2*mn-1] = 1
end
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] .= [1 0; 0 -1]
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] .= [1 0; 0 -1]
#method = ExpMethodHigham2005() # fastest
#cache = alloc_mem(H, method)
#time
tmin = -20; tmax = 20;
Nt0 = 200; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]
Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
for gh = Nt0:-1:2
E1 = exponential!((-im * 0.001) * H);
Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
f1 = exp(-0.5 * deltat0)
nbt = BLAS.get_num_threads()
BLAS.set_num_threads(1)
Threads.@threads for hi = gh-1:-1:1
Uu1ar[:, :, hi] .= exponential!(-im * deltat0 * H)
end
BLAS.set_num_threads(nbt)
for hi = gh-1:-1:1
Uu1ar[:, :, hi] .= f1 * Uu1ar[:, :, hi+1] * Uu1ar[:, :, hi]
end
end
end
computethreads();
```

and here are timing results using a fresh Julia REPL:

```
~/s/julia/e/maxtrix_exp julia --project -t8
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.2 (2023-07-05)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> @time include("Jtest2.jl")
12.108647 seconds (9.08 M allocations: 27.478 GiB, 6.97% gc time, 35.03% compilation time)
julia> @time computethreads();
7.813464 seconds (431.13 k allocations: 26.940 GiB, 8.61% gc time)
```

The second run, after package loading and compilation, takes about 7.8 seconds versus 9.9 seconds for the original code in Jtest.jl.

In the code above, the array `Uu1ar`

is used as temporary storage for multithreaded precomputation of all the matrix exponentials. You’ve previously stated that in your real application the matrix being exponentiated to form `f2`

depends on the loop index `hi`

. So whether or not you can do the multithreading as shown depends on whether this matrix can be computed independently for arbitrary loop indices, independent of the order of loop iteration.