Speeding up matrix exponential and matrix multiplication

eveningsilverfox · July 30, 2023, 4:22pm

I am very new to Julia and trying to convert my python codes to Julia to speed up my work. However, the present Julia version, pasted below, still runs slower than my python version.

I see two bottlenecks:
(1) the matrix exponential calculation, for which I have tried two methods. I have commented out the relatively slower one, fastExpm.
(2) the various matrix multiplications.

Are there any glaring inefficiencies in my code which can significantly speed up the process? I am surprised that this takes longer than my python code. I have set a small size for the matrix at the moment, N1=14, which creates the Hamiltonian of size 4x14=56 by 56. Ideally, this should work for N1~100.

using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 1; N = 2*N1; #size of Hamiltonian matrix (H_) is 2*N by 2*N

#Hamiltonian matrix H
h0 = [-1 0; 0 1];
hp = [-1 0; 0 1];
hm = [-1 0; 0 1];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
     zeros(Complex{Float64}, size(H0)) H0];
for mn = 1:N
    H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1;
end
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1];       
    
#time
tmin = -80; tmax = 280;
Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1];

#Green's function matrix
Glextt = zeros(Complex{Float64}, Nt0,2*N,2*N);
#Temporary green's function matrix, only used for the calculation for Glextt
Glesstt0 = zeros(Complex{Float64},2*N,2*N);

##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
for gh = Nt0:-1:2
    println(gh)
    Uu1ar = zeros(Complex{Float64},gh,2*N,2*N); Uu1ar[gh,:,:] = Diagonal(ones(2*N));
    for hi = gh-1:-1:1
        #option 1 https://github.com/fmentink/FastExpm.jl
        #global Uu1ar[hi,:,:] = exp(-0.5*Gamma*deltat0)*Uu1ar[hi+1,:,:]*fastExpm(-im*H*deltat0);
        #option 2 https://github.com/SciML/ExponentialUtilities.jl
        global Uu1ar[hi,:,:] = exp(-0.5*deltat0)*Uu1ar[hi+1,:,:]*exponential!(-im*H*deltat0);
    end
    Gr = -im*Uu1ar;
    Gap = conj(permutedims(-im*Uu1ar, (1, 3, 2)));

    Siglex0 = zeros(Complex{Float64},gh,gh);
    for mn = 1:gh
        global Siglex0[mn,:] = -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0);       
    end
    for pq = 1:N    
        global Glesstt0[2*pq-1,2*pq] = deltat0^2*tr( transpose(Gr[:,2*pq-1,:])*Siglex0*Gap[:,:,2*pq] );
    end
    global Glextt[gh,:,:] = Glesstt0; #G<(t,t)
end
##----------------------------------------------------------------------------------

mkitti · July 30, 2023, 4:36pm

A few general tips:

If you want to take advantage of compilation, you need to put this into a function or functions.
Eliminate global references.
Be aware that Julia is column major.

https://docs.julialang.org/en/v1/manual/performance-tips/

ChrisRackauckas · July 30, 2023, 4:41pm

You don’t need the exponential here, just the expm. Remove the exponential calculation and use expv!

eveningsilverfox · July 30, 2023, 4:45pm

But isn’t expv(t,A,b) used for exp(tA)*b, i.e., left-multiplying exp(tA) to the vector b? Here I am right-multiplying exp(tA) to the matrix to its left (Uu1ar…).

ChrisRackauckas · July 30, 2023, 4:50pm

Transpose

eveningsilverfox · July 30, 2023, 7:24pm

Thanks, but this one actually performs slower than exponential! or fastExpm

using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 20; N = 2*N1; mu = 0; delt = 1; t0 = 2*delt;
T = 0.1*delt;

#1d
h0 = [-mu 0; 0 mu];
hp = [-t0/2 0; 0 t0/2];
hm = [-t0/2 0; 0 t0/2];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
     zeros(Complex{Float64}, size(H0)) H0];
phife = pi/3;
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [T*exp(-im*phife/2) 0; 0 -T*exp(im*phife/2)];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [T*exp(im*phife/2) 0; 0 -T*exp(-im*phife/2)];       

d0 = 0.001;

t0a = time()
C = fastExpm(-im*H*d0);
t0b = time()

Ir = Diagonal(ones(Complex{Float64}, 2*N));
Cr = zeros(Complex{Float64}, 2*N,2*N);
t1a = time()
for ij = 1:2*N
    Cr[:,ij] = expv(1,-im*d0*transpose(H),Ir[:,ij]);
end
Cr = transpose(Cr);
t1b = time()

println(norm(Cr-C))
println(t0b-t0a)
println(t1b-t1a)

nsajko · July 30, 2023, 7:51pm

Please apply mkitti’s advice above.

ChrisRackauckas · July 30, 2023, 7:58pm

And use some views.

PeterSimon · July 30, 2023, 8:23pm

I just want to point out that the example you posted will not run. The loop beginning

 for gh = Nt0:-1:2

has no end statement. I tried inserting an end after the last line but the code still fails because Gamma is not defined. You would likely get a lot more useful suggestions if you supply complete working example.

eveningsilverfox · July 30, 2023, 8:50pm

Thanks, apologies. Corrected.

PeterSimon · July 30, 2023, 10:24pm

I put the code from your (edited) original post in the file “original.jl” and commented out the println statement. The timing was as follows:

julia> @time include("original.jl")
 85.059808 seconds (43.22 M allocations: 39.862 GiB, 3.20% gc time, 0.06% compilation time)

I then implemented some of the previous suggestions in this thread, resulting in the following function:

using LinearAlgebra
using FastExpm
using ExponentialUtilities

function compute()
    N1 = 1; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N

    #Hamiltonian matrix H
    h0 = [-1 0; 0 1]
    hp = [-1 0; 0 1]
    hm = [-1 0; 0 1]
    H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
    H = [H0                           zeros(ComplexF64, size(H0))
         zeros(ComplexF64, size(H0))  H0]
    for mn = 1:N
        H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1
    end
    H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1]
    H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1]
    
    #time
    tmin = -80; tmax = 280;
    Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]

    #Green's function matrix
    Glextt = zeros(ComplexF64, 2*N, 2*N, Nt0)
    #Temporary green's function matrix, only used for the calculation for Glextt
    Glesstt0 = zeros(ComplexF64,2*N,2*N)

    ##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
    Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
    Siglex0max = zeros(ComplexF64, Nt0, Nt0)
    for gh = Nt0:-1:2
        #println(gh)
        Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
        f1 = exp(-0.5 * deltat0)
        f2 = exponential!(-im * H * deltat0)
        for hi = gh-1:-1:1
             Uu1ar[:, :, hi] = f1 * (@view Uu1ar[:, :, hi+1]) * f2
        end
        Gr = -im*Uu1ar

        Gap = conj(permutedims(-im*Uu1ar, (3, 2, 1)))

        Siglex0 = @view Siglex0max[1:gh, 1:gh] 
        for mn = 1:gh
            Siglex0[mn, :] .= -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0)
        end
        for pq = 1:N    
            Glesstt0[2*pq-1, 2*pq] = deltat0^2 * tr((@view Gr[2*pq-1, :, :]) * Siglex0 * @view Gap[:, :, 2*pq])
        end
        Glextt[:, :, gh] .= Glesstt0 #G<(t,t)
    end
    ##----------------------------------------------------------------------------------
    return Glextt
end

Timing this function produced the following results:

julia> @time Glextt2 = compute();
 34.120436 seconds (1.17 M allocations: 1.684 GiB, 0.22% gc time)

And checking that the two codes returned the same result (modulo floating point errors and permuted dimensions):

julia> norm(Glextt2 - permutedims(Glextt, (2,3,1)))
8.164126053517088e-15

Reordering dimensions to exploit the column major Julia ordering, preallocating arrays, and using views significantly reduced the memory allocation and runtime. This was without profiling. With profiling, I’ll bet that others here (or you!) can do even better.

lmiq · July 30, 2023, 10:47pm

It should be a little more advertised that one can just add @views in front a whole code block, and then all slices stop allocating.

With your code above:

julia> @time Glextt2 = compute();
 49.784395 seconds (1.17 M allocations: 1.684 GiB, 0.45% gc time, 0.15% compilation time)

Now I removed the @view statementes and added a single @views in front of the for loop:

@views for gh = Nt0:-1:2

julia> @time Glextt2 = compute();
 49.550174 seconds (1.17 M allocations: 1.684 GiB, 0.34% gc time)

and the result is the same, essentially, but simpler to code (particularly if you are used to the fact that slices do not allocate, like in python).

I get some speedup here if I use MKL:

julia> using MKL

julia> @time Glextt2 = compute();
 43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)

PeterSimon · July 30, 2023, 10:59pm

Interesting–For me it’s slightly but consistently slower:

julia> @time Glextt2 = compute();
 35.355297 seconds (1.17 M allocations: 1.684 GiB, 0.37% gc time)

My setup:

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

lmiq · July 30, 2023, 11:01pm

Yes, MKL vs BLAS is sort of dependent on the configuration details. I’m using 1.10 but probably the difference is that I’m using Linux (my CPUs are slower, but also Intel’s).

PeterSimon · July 30, 2023, 11:15pm

Even on my Linux machine the times are virtually identical:

julia> @time Glextt2 = compute();
 31.631847 seconds (1.17 M allocations: 1.684 GiB, 0.14% gc time)

julia> using MKL

julia> @time Glextt2 = compute();
 31.869044 seconds (1.17 M allocations: 1.684 GiB, 0.07% gc time)

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

I’m guessing that OpenBLAS works relatively well on a Core i7-9700, compared to other architectures, at least for these kinds of operations.

eveningsilverfox · July 30, 2023, 11:34pm

Thanks. Could it be that a bulk of the savings was due to evaluating f2 = exponential!(-im * H * deltat0) only once outside the for hi = gh-1:-1:1 loop? I was actually using (not provided here for simplicity) a H which changes with the index hi, which is why I had it inside in the first place.

lmiq · July 30, 2023, 11:51pm

Ideally you should profile the code (VSCode @profview macro is probably the best tool there).

But you can also just test. For example:

julia> function test(H, delta0, Nt0)
              local f2
              for gh = Nt0:-1:2
                 for hi = gh-1:-1:1
                     f2 = exponential!(-im * H * deltat0)
                 end
               end
               return f2
           end
test (generic function with 1 method)

julia> @time test(H, deltat0, Nt0)
  6.401251 seconds (11.25 M allocations: 2.765 GiB, 6.13% gc time, 0.19% compilation time)

Which is still only a small fraction of the total time (here, ~50s). So the greatest bottleneck is somewhere else.

You can reduce a lot the number of allocations by using in-place matrix multiplications, using some buffer, like:

...
           m = similar(H) # create buffer
           @views for gh = Nt0:-1:2
               #println(gh)
               Uu1ar = (Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
               f1 = exp(-0.5 * deltat0)
               f2 = exponential!(-im * H * deltat0)
               for hi = gh-1:-1:1 
                    # in place multiplications (really didn't test for correctness)
                    mul!(m, Uu1ar[:, :, hi+1], f2)
                    mul!(f2, f1, m)
                    Uu1ar[:, :, hi] .= f2
               end
...

But that didn´t decrease significantly (actually perhaps increased) the time here:

julia> @time c2 = compute2();
 53.926530 seconds (45.69 k allocations: 1.332 GiB, 0.06% gc time)

vs previously:

julia> @time Glextt2 = compute();
 43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)

If the computation is mostly bound by the matrix multiplication now, then there is not much more than can be done (if the algorithm cannot be rearranged to avoid it).

ChrisRackauckas · July 31, 2023, 12:13am

Post a flame graph.

PeterSimon · July 31, 2023, 12:15am

When I put it back into the loop as follows:

        for hi = gh-1:-1:1
            f2 = exponential!(-im * H * deltat0)
            Uu1ar[:, :, hi] .= f1 * (@view Uu1ar[:, :, hi+1]) * f2
        end

I get (back on the Windows machine using OpenBLAS)

julia> @time Glextt2 = compute2();
 72.423707 seconds (12.40 M allocations: 4.444 GiB, 0.49% gc time)

so yes, the bulk of the previous savings were due to pulling this out of the loop. However, the time savings, though not as dramatic, are still significant, from 85 seconds to 72.4 seconds. And the allocations are reduced from 40 GBytes to 4.4 GBytes. I guess my next suggested steps would be profiling, and examining the optional inputs to exponential!.

ChrisRackauckas · July 31, 2023, 12:34am

Indeed using the alloc_mem would likely be helpful, but I’d need to see the flamegraph to know if that’s the actual time issue or if the allocations don’t matter.

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8397	March 5, 2017
Why is this simple function twice as slow as its Python version Performance question	97	4428	April 12, 2021
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19786	October 15, 2019
I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test Community fortran , performance , python , tullio , loopvectorization	37	7116	June 25, 2021
Julia vs NumPy broadcasting New to Julia	23	751	July 22, 2025

Speeding up matrix exponential and matrix multiplication

Related topics