# Speeding up matrix exponential and matrix multiplication

I am very new to Julia and trying to convert my python codes to Julia to speed up my work. However, the present Julia version, pasted below, still runs slower than my python version.

I see two bottlenecks:
(1) the matrix exponential calculation, for which I have tried two methods. I have commented out the relatively slower one, fastExpm.
(2) the various matrix multiplications.

Are there any glaring inefficiencies in my code which can significantly speed up the process? I am surprised that this takes longer than my python code. I have set a small size for the matrix at the moment, N1=14, which creates the Hamiltonian of size 4x14=56 by 56. Ideally, this should work for N1~100.

``````using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 1; N = 2*N1; #size of Hamiltonian matrix (H_) is 2*N by 2*N

#Hamiltonian matrix H
h0 = [-1 0; 0 1];
hp = [-1 0; 0 1];
hm = [-1 0; 0 1];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
zeros(Complex{Float64}, size(H0)) H0];
for mn = 1:N
H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1;
end
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1];

#time
tmin = -80; tmax = 280;
Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1];

#Green's function matrix
Glextt = zeros(Complex{Float64}, Nt0,2*N,2*N);
#Temporary green's function matrix, only used for the calculation for Glextt
Glesstt0 = zeros(Complex{Float64},2*N,2*N);

##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
for gh = Nt0:-1:2
println(gh)
Uu1ar = zeros(Complex{Float64},gh,2*N,2*N); Uu1ar[gh,:,:] = Diagonal(ones(2*N));
for hi = gh-1:-1:1
#option 1 https://github.com/fmentink/FastExpm.jl
#global Uu1ar[hi,:,:] = exp(-0.5*Gamma*deltat0)*Uu1ar[hi+1,:,:]*fastExpm(-im*H*deltat0);
#option 2 https://github.com/SciML/ExponentialUtilities.jl
global Uu1ar[hi,:,:] = exp(-0.5*deltat0)*Uu1ar[hi+1,:,:]*exponential!(-im*H*deltat0);
end
Gr = -im*Uu1ar;
Gap = conj(permutedims(-im*Uu1ar, (1, 3, 2)));

Siglex0 = zeros(Complex{Float64},gh,gh);
for mn = 1:gh
global Siglex0[mn,:] = -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0);
end
for pq = 1:N
global Glesstt0[2*pq-1,2*pq] = deltat0^2*tr( transpose(Gr[:,2*pq-1,:])*Siglex0*Gap[:,:,2*pq] );
end
global Glextt[gh,:,:] = Glesstt0; #G<(t,t)
end
##----------------------------------------------------------------------------------
``````

A few general tips:

1. If you want to take advantage of compilation, you need to put this into a function or functions.
2. Eliminate global references.
3. Be aware that Julia is column major.

https://docs.julialang.org/en/v1/manual/performance-tips/

3 Likes

You donâ€™t need the exponential here, just the expm. Remove the exponential calculation and use `expv!`

1 Like

But isnâ€™t expv(t,A,b) used for exp(tA)*b, i.e., left-multiplying exp(tA) to the vector b? Here I am right-multiplying exp(tA) to the matrix to its left (Uu1arâ€¦).

Transpose

Thanks, but this one actually performs slower than exponential! or fastExpm

``````using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 20; N = 2*N1; mu = 0; delt = 1; t0 = 2*delt;
T = 0.1*delt;

#1d
h0 = [-mu 0; 0 mu];
hp = [-t0/2 0; 0 t0/2];
hm = [-t0/2 0; 0 t0/2];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
zeros(Complex{Float64}, size(H0)) H0];
phife = pi/3;
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [T*exp(-im*phife/2) 0; 0 -T*exp(im*phife/2)];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [T*exp(im*phife/2) 0; 0 -T*exp(-im*phife/2)];

d0 = 0.001;

t0a = time()
C = fastExpm(-im*H*d0);
t0b = time()

Ir = Diagonal(ones(Complex{Float64}, 2*N));
Cr = zeros(Complex{Float64}, 2*N,2*N);
t1a = time()
for ij = 1:2*N
Cr[:,ij] = expv(1,-im*d0*transpose(H),Ir[:,ij]);
end
Cr = transpose(Cr);
t1b = time()

println(norm(Cr-C))
println(t0b-t0a)
println(t1b-t1a)
``````

1 Like

And use some views.

1 Like

I just want to point out that the example you posted will not run. The loop beginning

`````` for gh = Nt0:-1:2
``````

has no `end` statement. I tried inserting an `end` after the last line but the code still fails because `Gamma` is not defined. You would likely get a lot more useful suggestions if you supply complete working example.

3 Likes

Thanks, apologies. Corrected.

I put the code from your (edited) original post in the file â€śoriginal.jlâ€ť and commented out the `println` statement. The timing was as follows:

``````julia> @time include("original.jl")
85.059808 seconds (43.22 M allocations: 39.862 GiB, 3.20% gc time, 0.06% compilation time)
``````

I then implemented some of the previous suggestions in this thread, resulting in the following function:

``````using LinearAlgebra
using FastExpm
using ExponentialUtilities

function compute()
N1 = 1; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N

#Hamiltonian matrix H
h0 = [-1 0; 0 1]
hp = [-1 0; 0 1]
hm = [-1 0; 0 1]
H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
H = [H0                           zeros(ComplexF64, size(H0))
zeros(ComplexF64, size(H0))  H0]
for mn = 1:N
H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1
end
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1]
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1]

#time
tmin = -80; tmax = 280;
Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]

#Green's function matrix
Glextt = zeros(ComplexF64, 2*N, 2*N, Nt0)
#Temporary green's function matrix, only used for the calculation for Glextt
Glesstt0 = zeros(ComplexF64,2*N,2*N)

##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
Siglex0max = zeros(ComplexF64, Nt0, Nt0)
for gh = Nt0:-1:2
#println(gh)
Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
f1 = exp(-0.5 * deltat0)
f2 = exponential!(-im * H * deltat0)
for hi = gh-1:-1:1
Uu1ar[:, :, hi] = f1 * (@view Uu1ar[:, :, hi+1]) * f2
end
Gr = -im*Uu1ar

Gap = conj(permutedims(-im*Uu1ar, (3, 2, 1)))

Siglex0 = @view Siglex0max[1:gh, 1:gh]
for mn = 1:gh
Siglex0[mn, :] .= -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0)
end
for pq = 1:N
Glesstt0[2*pq-1, 2*pq] = deltat0^2 * tr((@view Gr[2*pq-1, :, :]) * Siglex0 * @view Gap[:, :, 2*pq])
end
Glextt[:, :, gh] .= Glesstt0 #G<(t,t)
end
##----------------------------------------------------------------------------------
return Glextt
end
``````

Timing this function produced the following results:

``````julia> @time Glextt2 = compute();
34.120436 seconds (1.17 M allocations: 1.684 GiB, 0.22% gc time)
``````

And checking that the two codes returned the same result (modulo floating point errors and permuted dimensions):

``````julia> norm(Glextt2 - permutedims(Glextt, (2,3,1)))
8.164126053517088e-15
``````

Reordering dimensions to exploit the column major Julia ordering, preallocating arrays, and using views significantly reduced the memory allocation and runtime. This was without profiling. With profiling, Iâ€™ll bet that others here (or you!) can do even better.

3 Likes

It should be a little more advertised that one can just add `@views` in front a whole code block, and then all slices stop allocating.

``````julia> @time Glextt2 = compute();
49.784395 seconds (1.17 M allocations: 1.684 GiB, 0.45% gc time, 0.15% compilation time)
``````

Now I removed the `@view` statementes and added a single `@views` in front of the for loop:

``````@views for gh = Nt0:-1:2
``````
``````julia> @time Glextt2 = compute();
49.550174 seconds (1.17 M allocations: 1.684 GiB, 0.34% gc time)
``````

and the result is the same, essentially, but simpler to code (particularly if you are used to the fact that slices do not allocate, like in python).

I get some speedup here if I use MKL:

``````julia> using MKL

julia> @time Glextt2 = compute();
43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)
``````
2 Likes

Interestingâ€“For me itâ€™s slightly but consistently slower:

``````julia> @time Glextt2 = compute();
35.355297 seconds (1.17 M allocations: 1.684 GiB, 0.37% gc time)
``````

My setup:

``````julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Ă— Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores
Environment:
JULIA_EDITOR = code
``````

Yes, MKL vs BLAS is sort of dependent on the configuration details. Iâ€™m using 1.10 but probably the difference is that Iâ€™m using Linux (my CPUs are slower, but also Intelâ€™s).

Even on my Linux machine the times are virtually identical:

``````julia> @time Glextt2 = compute();
31.631847 seconds (1.17 M allocations: 1.684 GiB, 0.14% gc time)

julia> using MKL

julia> @time Glextt2 = compute();
31.869044 seconds (1.17 M allocations: 1.684 GiB, 0.07% gc time)

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 Ă— Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 8 on 8 virtual cores
Environment:
``````

Iâ€™m guessing that OpenBLAS works relatively well on a Core i7-9700, compared to other architectures, at least for these kinds of operations.

Thanks. Could it be that a bulk of the savings was due to evaluating `f2 = exponential!(-im * H * deltat0)` only once outside the `for hi = gh-1:-1:1` loop? I was actually using (not provided here for simplicity) a H which changes with the index hi, which is why I had it inside in the first place.

Ideally you should profile the code (VSCode `@profview` macro is probably the best tool there).

But you can also just test. For example:

``````julia> function test(H, delta0, Nt0)
local f2
for gh = Nt0:-1:2
for hi = gh-1:-1:1
f2 = exponential!(-im * H * deltat0)
end
end
return f2
end
test (generic function with 1 method)

julia> @time test(H, deltat0, Nt0)
6.401251 seconds (11.25 M allocations: 2.765 GiB, 6.13% gc time, 0.19% compilation time)
``````

Which is still only a small fraction of the total time (here, ~50s). So the greatest bottleneck is somewhere else.

You can reduce a lot the number of allocations by using in-place matrix multiplications, using some buffer, like:

``````...
m = similar(H) # create buffer
@views for gh = Nt0:-1:2
#println(gh)
Uu1ar = (Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
f1 = exp(-0.5 * deltat0)
f2 = exponential!(-im * H * deltat0)
for hi = gh-1:-1:1
# in place multiplications (really didn't test for correctness)
mul!(m, Uu1ar[:, :, hi+1], f2)
mul!(f2, f1, m)
Uu1ar[:, :, hi] .= f2
end
...
``````

But that didnÂ´t decrease significantly (actually perhaps increased) the time here:

``````julia> @time c2 = compute2();
53.926530 seconds (45.69 k allocations: 1.332 GiB, 0.06% gc time)
``````

vs previously:

``````julia> @time Glextt2 = compute();
43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)
``````

If the computation is mostly bound by the matrix multiplication now, then there is not much more than can be done (if the algorithm cannot be rearranged to avoid it).

Post a flame graph.

When I put it back into the loop as follows:

``````        for hi = gh-1:-1:1
f2 = exponential!(-im * H * deltat0)
Uu1ar[:, :, hi] .= f1 * (@view Uu1ar[:, :, hi+1]) * f2
end
``````

I get (back on the Windows machine using OpenBLAS)

``````julia> @time Glextt2 = compute2();
72.423707 seconds (12.40 M allocations: 4.444 GiB, 0.49% gc time)
``````

so yes, the bulk of the previous savings were due to pulling this out of the loop. However, the time savings, though not as dramatic, are still significant, from 85 seconds to 72.4 seconds. And the allocations are reduced from 40 GBytes to 4.4 GBytes. I guess my next suggested steps would be profiling, and examining the optional inputs to `exponential!`.

Indeed using the `alloc_mem` would likely be helpful, but Iâ€™d need to see the flamegraph to know if thatâ€™s the actual time issue or if the allocations donâ€™t matter.