Speeding up matrix exponential and matrix multiplication

I am very new to Julia and trying to convert my python codes to Julia to speed up my work. However, the present Julia version, pasted below, still runs slower than my python version.

I see two bottlenecks:
(1) the matrix exponential calculation, for which I have tried two methods. I have commented out the relatively slower one, fastExpm.
(2) the various matrix multiplications.

Are there any glaring inefficiencies in my code which can significantly speed up the process? I am surprised that this takes longer than my python code. I have set a small size for the matrix at the moment, N1=14, which creates the Hamiltonian of size 4x14=56 by 56. Ideally, this should work for N1~100.

using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 1; N = 2*N1; #size of Hamiltonian matrix (H_) is 2*N by 2*N

#Hamiltonian matrix H
h0 = [-1 0; 0 1];
hp = [-1 0; 0 1];
hm = [-1 0; 0 1];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
     zeros(Complex{Float64}, size(H0)) H0];
for mn = 1:N
    H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1;
end
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1];       
    
#time
tmin = -80; tmax = 280;
Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1];

#Green's function matrix
Glextt = zeros(Complex{Float64}, Nt0,2*N,2*N);
#Temporary green's function matrix, only used for the calculation for Glextt
Glesstt0 = zeros(Complex{Float64},2*N,2*N);

##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
for gh = Nt0:-1:2
    println(gh)
    Uu1ar = zeros(Complex{Float64},gh,2*N,2*N); Uu1ar[gh,:,:] = Diagonal(ones(2*N));
    for hi = gh-1:-1:1
        #option 1 https://github.com/fmentink/FastExpm.jl
        #global Uu1ar[hi,:,:] = exp(-0.5*Gamma*deltat0)*Uu1ar[hi+1,:,:]*fastExpm(-im*H*deltat0);
        #option 2 https://github.com/SciML/ExponentialUtilities.jl
        global Uu1ar[hi,:,:] = exp(-0.5*deltat0)*Uu1ar[hi+1,:,:]*exponential!(-im*H*deltat0);
    end
    Gr = -im*Uu1ar;
    Gap = conj(permutedims(-im*Uu1ar, (1, 3, 2)));

    Siglex0 = zeros(Complex{Float64},gh,gh);
    for mn = 1:gh
        global Siglex0[mn,:] = -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0);       
    end
    for pq = 1:N    
        global Glesstt0[2*pq-1,2*pq] = deltat0^2*tr( transpose(Gr[:,2*pq-1,:])*Siglex0*Gap[:,:,2*pq] );
    end
    global Glextt[gh,:,:] = Glesstt0; #G<(t,t)
end
##----------------------------------------------------------------------------------

A few general tips:

  1. If you want to take advantage of compilation, you need to put this into a function or functions.
  2. Eliminate global references.
  3. Be aware that Julia is column major.

https://docs.julialang.org/en/v1/manual/performance-tips/

3 Likes

You don’t need the exponential here, just the expm. Remove the exponential calculation and use expv!

1 Like

But isn’t expv(t,A,b) used for exp(tA)*b, i.e., left-multiplying exp(tA) to the vector b? Here I am right-multiplying exp(tA) to the matrix to its left (Uu1ar…).

Transpose

Thanks, but this one actually performs slower than exponential! or fastExpm

using LinearAlgebra
using FastExpm
using ExponentialUtilities

N1 = 20; N = 2*N1; mu = 0; delt = 1; t0 = 2*delt;
T = 0.1*delt;

#1d
h0 = [-mu 0; 0 mu];
hp = [-t0/2 0; 0 t0/2];
hm = [-t0/2 0; 0 t0/2];
H0 = kron(Diagonal(ones(N1,N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm);
H = [H0 zeros(Complex{Float64}, size(H0))
     zeros(Complex{Float64}, size(H0)) H0];
phife = pi/3;
H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [T*exp(-im*phife/2) 0; 0 -T*exp(im*phife/2)];
H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [T*exp(im*phife/2) 0; 0 -T*exp(-im*phife/2)];       

d0 = 0.001;

t0a = time()
C = fastExpm(-im*H*d0);
t0b = time()

Ir = Diagonal(ones(Complex{Float64}, 2*N));
Cr = zeros(Complex{Float64}, 2*N,2*N);
t1a = time()
for ij = 1:2*N
    Cr[:,ij] = expv(1,-im*d0*transpose(H),Ir[:,ij]);
end
Cr = transpose(Cr);
t1b = time()

println(norm(Cr-C))
println(t0b-t0a)
println(t1b-t1a)

Please apply mkitti’s advice above.

1 Like

And use some views.

1 Like

I just want to point out that the example you posted will not run. The loop beginning

 for gh = Nt0:-1:2

has no end statement. I tried inserting an end after the last line but the code still fails because Gamma is not defined. You would likely get a lot more useful suggestions if you supply complete working example.

3 Likes

Thanks, apologies. Corrected.

I put the code from your (edited) original post in the file “original.jl” and commented out the println statement. The timing was as follows:

julia> @time include("original.jl")
 85.059808 seconds (43.22 M allocations: 39.862 GiB, 3.20% gc time, 0.06% compilation time)

I then implemented some of the previous suggestions in this thread, resulting in the following function:

using LinearAlgebra
using FastExpm
using ExponentialUtilities

function compute()
    N1 = 1; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N

    #Hamiltonian matrix H
    h0 = [-1 0; 0 1]
    hp = [-1 0; 0 1]
    hm = [-1 0; 0 1]
    H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
    H = [H0                           zeros(ComplexF64, size(H0))
         zeros(ComplexF64, size(H0))  H0]
    for mn = 1:N
        H[2*mn-1,2*mn] = 1; H[2*mn,2*mn-1] = 1
    end
    H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] = [1 0; 0 -1]
    H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] = [1 0; 0 -1]
    
    #time
    tmin = -80; tmax = 280;
    Nt0 = 1500; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]

    #Green's function matrix
    Glextt = zeros(ComplexF64, 2*N, 2*N, Nt0)
    #Temporary green's function matrix, only used for the calculation for Glextt
    Glesstt0 = zeros(ComplexF64,2*N,2*N)

    ##-----------------CALCULATION WHICH NEEDS TO BE SPED UP----------------------------
    Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
    Siglex0max = zeros(ComplexF64, Nt0, Nt0)
    for gh = Nt0:-1:2
        #println(gh)
        Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
        f1 = exp(-0.5 * deltat0)
        f2 = exponential!(-im * H * deltat0)
        for hi = gh-1:-1:1
             Uu1ar[:, :, hi] = f1 * (@view Uu1ar[:, :, hi+1]) * f2
        end
        Gr = -im*Uu1ar

        Gap = conj(permutedims(-im*Uu1ar, (3, 2, 1)))

        Siglex0 = @view Siglex0max[1:gh, 1:gh] 
        for mn = 1:gh
            Siglex0[mn, :] .= -1 ./ (tar0[mn] .- tar0[1:gh] .+ im*deltat0)
        end
        for pq = 1:N    
            Glesstt0[2*pq-1, 2*pq] = deltat0^2 * tr((@view Gr[2*pq-1, :, :]) * Siglex0 * @view Gap[:, :, 2*pq])
        end
        Glextt[:, :, gh] .= Glesstt0 #G<(t,t)
    end
    ##----------------------------------------------------------------------------------
    return Glextt
end

Timing this function produced the following results:

julia> @time Glextt2 = compute();
 34.120436 seconds (1.17 M allocations: 1.684 GiB, 0.22% gc time)

And checking that the two codes returned the same result (modulo floating point errors and permuted dimensions):

julia> norm(Glextt2 - permutedims(Glextt, (2,3,1)))
8.164126053517088e-15

Reordering dimensions to exploit the column major Julia ordering, preallocating arrays, and using views significantly reduced the memory allocation and runtime. This was without profiling. With profiling, I’ll bet that others here (or you!) can do even better.

3 Likes

It should be a little more advertised that one can just add @views in front a whole code block, and then all slices stop allocating.

With your code above:

julia> @time Glextt2 = compute();
 49.784395 seconds (1.17 M allocations: 1.684 GiB, 0.45% gc time, 0.15% compilation time)

Now I removed the @view statementes and added a single @views in front of the for loop:

@views for gh = Nt0:-1:2
julia> @time Glextt2 = compute();
 49.550174 seconds (1.17 M allocations: 1.684 GiB, 0.34% gc time)

and the result is the same, essentially, but simpler to code (particularly if you are used to the fact that slices do not allocate, like in python).

I get some speedup here if I use MKL:

julia> using MKL

julia> @time Glextt2 = compute();
 43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)
2 Likes

Interesting–For me it’s slightly but consistently slower:

julia> @time Glextt2 = compute();
 35.355297 seconds (1.17 M allocations: 1.684 GiB, 0.37% gc time)

My setup:

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Ă— Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

Yes, MKL vs BLAS is sort of dependent on the configuration details. I’m using 1.10 but probably the difference is that I’m using Linux (my CPUs are slower, but also Intel’s).

Even on my Linux machine the times are virtually identical:

julia> @time Glextt2 = compute();
 31.631847 seconds (1.17 M allocations: 1.684 GiB, 0.14% gc time)

julia> using MKL

julia> @time Glextt2 = compute();
 31.869044 seconds (1.17 M allocations: 1.684 GiB, 0.07% gc time)

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 Ă— Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

I’m guessing that OpenBLAS works relatively well on a Core i7-9700, compared to other architectures, at least for these kinds of operations.

Thanks. Could it be that a bulk of the savings was due to evaluating f2 = exponential!(-im * H * deltat0) only once outside the for hi = gh-1:-1:1 loop? I was actually using (not provided here for simplicity) a H which changes with the index hi, which is why I had it inside in the first place.

Ideally you should profile the code (VSCode @profview macro is probably the best tool there).

But you can also just test. For example:

julia> function test(H, delta0, Nt0)
              local f2
              for gh = Nt0:-1:2
                 for hi = gh-1:-1:1
                     f2 = exponential!(-im * H * deltat0)
                 end
               end
               return f2
           end
test (generic function with 1 method)

julia> @time test(H, deltat0, Nt0)
  6.401251 seconds (11.25 M allocations: 2.765 GiB, 6.13% gc time, 0.19% compilation time)

Which is still only a small fraction of the total time (here, ~50s). So the greatest bottleneck is somewhere else.

You can reduce a lot the number of allocations by using in-place matrix multiplications, using some buffer, like:

...
           m = similar(H) # create buffer
           @views for gh = Nt0:-1:2
               #println(gh)
               Uu1ar = (Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
               f1 = exp(-0.5 * deltat0)
               f2 = exponential!(-im * H * deltat0)
               for hi = gh-1:-1:1 
                    # in place multiplications (really didn't test for correctness)
                    mul!(m, Uu1ar[:, :, hi+1], f2)
                    mul!(f2, f1, m)
                    Uu1ar[:, :, hi] .= f2
               end
...

But that didn´t decrease significantly (actually perhaps increased) the time here:

julia> @time c2 = compute2();
 53.926530 seconds (45.69 k allocations: 1.332 GiB, 0.06% gc time)

vs previously:

julia> @time Glextt2 = compute();
 43.656992 seconds (1.17 M allocations: 1.684 GiB, 0.20% gc time)

If the computation is mostly bound by the matrix multiplication now, then there is not much more than can be done (if the algorithm cannot be rearranged to avoid it).

Post a flame graph.

When I put it back into the loop as follows:

        for hi = gh-1:-1:1
            f2 = exponential!(-im * H * deltat0)
            Uu1ar[:, :, hi] .= f1 * (@view Uu1ar[:, :, hi+1]) * f2
        end

I get (back on the Windows machine using OpenBLAS)

julia> @time Glextt2 = compute2();
 72.423707 seconds (12.40 M allocations: 4.444 GiB, 0.49% gc time)

so yes, the bulk of the previous savings were due to pulling this out of the loop. However, the time savings, though not as dramatic, are still significant, from 85 seconds to 72.4 seconds. And the allocations are reduced from 40 GBytes to 4.4 GBytes. I guess my next suggested steps would be profiling, and examining the optional inputs to exponential!.

Indeed using the alloc_mem would likely be helpful, but I’d need to see the flamegraph to know if that’s the actual time issue or if the allocations don’t matter.