Speeding up matrix exponential and matrix multiplication

My only 2c there would be to try adding BLAS.set_num_threads(16) after using ... or using MKL, to see what happens.

Otherwise seems something pretty platform specific, and one would have to go lower level to understand what’s going on (if the time is spent mostly on matrix operations you are comparing the lower level routines of BLAS that Julia and python are calling in your system, and they shouldn’t be much different normally).

1 Like

So for a final comparison,
1.) Julia without using MKL and without BLAS.set_num_threads(16)

real	0m24,170s
user	1m33,913s
sys	1m16,317s

2.) Julia without using MKL and with BLAS.set_num_threads(16)

real	1m1,223s
user	6m57,598s
sys	8m0,850s

3.) Julia with using MKL and without BLAS.set_num_threads(16)

real	0m17,349s
user	1m53,660s
sys	0m1,825s

4.) Python

real	0m11,892s
user	1m33,676s
sys	0m0,569s

It seems for some unknown non-generic reason, Python is acing in my setup.

1 Like

My results on a Core i7-9700 running Manjaro Linux are very different than @eveningsilverfox . I get the following:

 ~/s/julia/e/maxtrix_exp  time python3 Ptest.py  
python3 Ptest.py  145.21s user 0.04s system 99% cpu 2:25.35 total

 ~/s/julia/e/maxtrix_exp  time julia --project Jtest.jl   
julia --project Jtest.jl  40.05s user 21.53s system 331% cpu 18.589 total

The Julia script runs much faster. Since I’m not a Pythonista, it’s possible that I don’t have the most optimally configured Python system installed.

1 Like

Well, I have no idea what I did, but I had installed linux last week in my laptop. And along with it, a fresh serving of the latest anaconda distribution.

I did some reading and installed the MKL-based Numpy package. Now the Python code is very fast:

 ~/s/julia/e/maxtrix_exp  time python3 Ptest.py
python3 Ptest.py  79.61s user 0.86s system 782% cpu 10.287 total

The Julia timing with using MKL added to the script:

~/s/julia/e/maxtrix_exp  time julia --project Jtest.jl      ✔  10s  
julia --project Jtest.jl  82.62s user 1.32s system 588% cpu 14.271 total

So my results are now similar to @eveningsilverfox. Based on the CPU percentages reported by the the time command it looks like Python is multithreading more effectively than Julia here.

1 Like

The greater time required for Julia is due to compilation, at least on my machine. From the Julia prompt:

julia> @time include("Jtest.jl")
 14.466155 seconds (8.89 M allocations: 24.501 GiB, 3.11% gc time, 26.79% compilation time)

julia> @time computethreads();
  9.862676 seconds (381.32 k allocations: 23.973 GiB, 1.83% gc time)

The timing for the include statement includes almost 27% compilation time. Once the code has been compiled, reexecuting the function only requires about 10 seconds, similar to the Python code.

For larger cases where execution time is much larger, the startup and compilation time will remain constant and constitute a smaller and smaller fraction of the total elapsed time.

1 Like

Ok, so it seems that the takeway is:

  1. Python is using MKL, and that is part of the difference.
  2. The other part of the difference, in the MWE is julia’s compilation.

That makes sense, and that’s expected in the case the computation is bounded by the linear algebra computations only.

One thing that could be tried, in a real scenario, is to use @threads in a loop combined with BLAS.set_num_threads(1) (or the equivalent for MKL), if (was I understood you have identified) one of the loops can be trivially parallelized. Then, instead of threading the LA operations one threads at a higher level and that might be better.

2 Likes

Edit: The trivially parallel loop that was present in the full original example is not included in what @eveningsilverfox placed into “Jtest.jl”. As pointed out by @DNF below, what remains is not thread-safe.

I’m not sure exactly why H is complex with zero-valued imaginary parts, if this is somehow not part of the original assumptions. But, if H can actually be considered to be real-symmetric, I would try this:

Hsym = [ -1.0  1.0  -1.0   0.0   0.0   0.0   0.0  0.0
          1.0  1.0   0.0   1.0   0.0   0.0   0.0  0.0
         -1.0  0.0  -1.0   1.0   1.0   0.0   0.0  0.0
          0.0  1.0   1.0   1.0   0.0  -1.0   0.0  0.0
          0.0  0.0   1.0   0.0  -1.0   1.0  -1.0  0.0
          0.0  0.0   0.0  -1.0   1.0   1.0   0.0  1.0
          0.0  0.0   0.0   0.0  -1.0   0.0  -1.0  1.0
          0.0  0.0   0.0   0.0   0.0   1.0   1.0  1.0]

Hcomp = Complex.(Hreal)
Hsym = Symmetric(Hreal)

@btime exponential!(-0.001im * $Hcomp)
# 52.300 μs (9 allocations: 6.72 KiB)

@btime cis(-0.001 * $Hsym)
# 10.800 μs (13 allocations: 7.72 KiB)

julia> cis(-0.001 * Hsym) ≈ exponential!(-0.001im * Hcomp)
true

Also, this:

Uu1ar[:, :, hi] .= f1 * Uu1ar[:, :, hi+1] * f2;

could be in-place (using mul!), and could use views.

Note that cis on a real-symmetric matrix, similar to exp on a Hermitian matrix, simply uses diagonalization. So you can also use eigen or eigen! directly if you want more control.

Is this thread-safe? Doesn’t each iteration depend on the previous one?

You’re right. It is not thread-safe. I had originally, in an earlier post involving the OPs original, longer code, reported on threading a different loop that was safe. That loop is not included in this shorter code snippet. Here there is not opportunity for threading. I’ll note the error in my previous post.

This expands on it:

Hm. I didn’t try MKL until just now. That changes the picture dramatically:

# with MKL
julia> @btime exponential!(-0.001im * $Hcomp)
  3.750 μs (9 allocations: 6.72 KiB)  # 14x faster than OpenBLAS!!

julia> @btime cis(-0.001 * $Hsym);
  9.133 μs (13 allocations: 7.36 KiB)

I consider Hermitian H. Here it’s real symmetric just for simplicity.

OK, it’s just that sometimes simplifying things makes it harder to follow and understand the problem. I suggest not simplifying to the point where essential properties of the problem are lost.

1 Like

Sure. Fair point

@eveningsilverfox , by restructuring the code slightly in your Jtest.jl example, we can use @lmiq’s suggestion and safely incorporate multithreading for part of the computation. This provides a 20% or so speedup, not counting startup/compilation time. Whether or not this can also be used in your actual application depends on details you haven’t yet provided. More on this after the code and timing results.

Here is the contents of Jtest2.jl:

sing LinearAlgebra
using ExponentialUtilities
using ExponentialUtilities: alloc_mem
using MKL
BLAS.set_num_threads(1)


function computethreads()
    N1 = 25; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N

    #Hamiltonian matrix H
    h0 = [-1 0; 0 1]
    hp = [-1 0; 0 1]
    hm = [-1 0; 0 1]
    H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
    H = [H0  zeros(ComplexF64, size(H0))
         zeros(ComplexF64, size(H0))  H0]
    #@show size(H)
    for mn = 1:N
        H[2*mn-1,2*mn] = H[2*mn,2*mn-1] = 1
    end
    H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] .= [1 0; 0 -1]
    H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] .= [1 0; 0 -1]
    
    #method = ExpMethodHigham2005() # fastest
    #cache = alloc_mem(H, method) 
    
    #time
    tmin = -20; tmax = 20;
    Nt0 = 200; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]

    Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
    for gh = Nt0:-1:2
        
        E1 = exponential!((-im * 0.001) * H);

        Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
        f1 = exp(-0.5 * deltat0)

        nbt = BLAS.get_num_threads()
        BLAS.set_num_threads(1)
        Threads.@threads for hi = gh-1:-1:1
            Uu1ar[:, :, hi] .= exponential!(-im * deltat0 * H)
        end
        BLAS.set_num_threads(nbt)

        for hi = gh-1:-1:1
            Uu1ar[:, :, hi] .= f1 * Uu1ar[:, :, hi+1] * Uu1ar[:, :, hi]
        end
    end
end

computethreads();

and here are timing results using a fresh Julia REPL:

 ~/s/julia/e/maxtrix_exp  julia --project -t8 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time include("Jtest2.jl")
 12.108647 seconds (9.08 M allocations: 27.478 GiB, 6.97% gc time, 35.03% compilation time)

julia> @time computethreads();
  7.813464 seconds (431.13 k allocations: 26.940 GiB, 8.61% gc time)

The second run, after package loading and compilation, takes about 7.8 seconds versus 9.9 seconds for the original code in Jtest.jl.

In the code above, the array Uu1ar is used as temporary storage for multithreaded precomputation of all the matrix exponentials. You’ve previously stated that in your real application the matrix being exponentiated to form f2 depends on the loop index hi. So whether or not you can do the multithreading as shown depends on whether this matrix can be computed independently for arbitrary loop indices, independent of the order of loop iteration.

1 Like

This is nice. H depends only on the present value of hi, so it is giving a good speed-up.

1 Like