Speeding up matrix exponential and matrix multiplication

lmiq · July 31, 2023, 7:50pm

My only 2c there would be to try adding BLAS.set_num_threads(16) after using ... or using MKL, to see what happens.

Otherwise seems something pretty platform specific, and one would have to go lower level to understand what’s going on (if the time is spent mostly on matrix operations you are comparing the lower level routines of BLAS that Julia and python are calling in your system, and they shouldn’t be much different normally).

eveningsilverfox · July 31, 2023, 7:59pm

So for a final comparison,
1.) Julia without using MKL and without BLAS.set_num_threads(16)

real	0m24,170s
user	1m33,913s
sys	1m16,317s

2.) Julia without using MKL and with BLAS.set_num_threads(16)

real	1m1,223s
user	6m57,598s
sys	8m0,850s

3.) Julia with using MKL and without BLAS.set_num_threads(16)

real	0m17,349s
user	1m53,660s
sys	0m1,825s

4.) Python

real	0m11,892s
user	1m33,676s
sys	0m0,569s

It seems for some unknown non-generic reason, Python is acing in my setup.

PeterSimon · July 31, 2023, 8:01pm

My results on a Core i7-9700 running Manjaro Linux are very different than @eveningsilverfox . I get the following:

 ~/s/julia/e/maxtrix_exp  time python3 Ptest.py  
python3 Ptest.py  145.21s user 0.04s system 99% cpu 2:25.35 total

 ~/s/julia/e/maxtrix_exp  time julia --project Jtest.jl   
julia --project Jtest.jl  40.05s user 21.53s system 331% cpu 18.589 total

The Julia script runs much faster. Since I’m not a Pythonista, it’s possible that I don’t have the most optimally configured Python system installed.

eveningsilverfox · July 31, 2023, 8:11pm

Well, I have no idea what I did, but I had installed linux last week in my laptop. And along with it, a fresh serving of the latest anaconda distribution.

PeterSimon · July 31, 2023, 8:58pm

I did some reading and installed the MKL-based Numpy package. Now the Python code is very fast:

 ~/s/julia/e/maxtrix_exp  time python3 Ptest.py
python3 Ptest.py  79.61s user 0.86s system 782% cpu 10.287 total

The Julia timing with using MKL added to the script:

~/s/julia/e/maxtrix_exp  time julia --project Jtest.jl      ✔  10s  
julia --project Jtest.jl  82.62s user 1.32s system 588% cpu 14.271 total

So my results are now similar to @eveningsilverfox. Based on the CPU percentages reported by the the time command it looks like Python is multithreading more effectively than Julia here.

PeterSimon · July 31, 2023, 9:27pm

The greater time required for Julia is due to compilation, at least on my machine. From the Julia prompt:

julia> @time include("Jtest.jl")
 14.466155 seconds (8.89 M allocations: 24.501 GiB, 3.11% gc time, 26.79% compilation time)

julia> @time computethreads();
  9.862676 seconds (381.32 k allocations: 23.973 GiB, 1.83% gc time)

The timing for the include statement includes almost 27% compilation time. Once the code has been compiled, reexecuting the function only requires about 10 seconds, similar to the Python code.

For larger cases where execution time is much larger, the startup and compilation time will remain constant and constitute a smaller and smaller fraction of the total elapsed time.

lmiq · July 31, 2023, 9:47pm

Ok, so it seems that the takeway is:

Python is using MKL, and that is part of the difference.
The other part of the difference, in the MWE is julia’s compilation.

That makes sense, and that’s expected in the case the computation is bounded by the linear algebra computations only.

One thing that could be tried, in a real scenario, is to use @threads in a loop combined with BLAS.set_num_threads(1) (or the equivalent for MKL), if (was I understood you have identified) one of the loops can be trivially parallelized. Then, instead of threading the LA operations one threads at a higher level and that might be better.

PeterSimon · July 31, 2023, 10:07pm

Edit: The trivially parallel loop that was present in the full original example is not included in what @eveningsilverfox placed into “Jtest.jl”. As pointed out by @DNF below, what remains is not thread-safe.

DNF · July 31, 2023, 10:13pm

I’m not sure exactly why H is complex with zero-valued imaginary parts, if this is somehow not part of the original assumptions. But, if H can actually be considered to be real-symmetric, I would try this:

Hsym = [ -1.0  1.0  -1.0   0.0   0.0   0.0   0.0  0.0
          1.0  1.0   0.0   1.0   0.0   0.0   0.0  0.0
         -1.0  0.0  -1.0   1.0   1.0   0.0   0.0  0.0
          0.0  1.0   1.0   1.0   0.0  -1.0   0.0  0.0
          0.0  0.0   1.0   0.0  -1.0   1.0  -1.0  0.0
          0.0  0.0   0.0  -1.0   1.0   1.0   0.0  1.0
          0.0  0.0   0.0   0.0  -1.0   0.0  -1.0  1.0
          0.0  0.0   0.0   0.0   0.0   1.0   1.0  1.0]

Hcomp = Complex.(Hreal)
Hsym = Symmetric(Hreal)

@btime exponential!(-0.001im * $Hcomp)
# 52.300 μs (9 allocations: 6.72 KiB)

@btime cis(-0.001 * $Hsym)
# 10.800 μs (13 allocations: 7.72 KiB)

julia> cis(-0.001 * Hsym) ≈ exponential!(-0.001im * Hcomp)
true

Also, this:

Uu1ar[:, :, hi] .= f1 * Uu1ar[:, :, hi+1] * f2;

could be in-place (using mul!), and could use views.

stevengj · July 31, 2023, 10:19pm

Note that cis on a real-symmetric matrix, similar to exp on a Hermitian matrix, simply uses diagonalization. So you can also use eigen or eigen! directly if you want more control.

DNF · July 31, 2023, 10:24pm

Is this thread-safe? Doesn’t each iteration depend on the previous one?

PeterSimon · July 31, 2023, 10:30pm

You’re right. It is not thread-safe. I had originally, in an earlier post involving the OPs original, longer code, reported on threading a different loop that was safe. That loop is not included in this shorter code snippet. Here there is not opportunity for threading. I’ll note the error in my previous post.

eveningsilverfox · July 31, 2023, 10:31pm

This expands on it:

DNF · July 31, 2023, 10:35pm

Hm. I didn’t try MKL until just now. That changes the picture dramatically:

# with MKL
julia> @btime exponential!(-0.001im * $Hcomp)
  3.750 μs (9 allocations: 6.72 KiB)  # 14x faster than OpenBLAS!!

julia> @btime cis(-0.001 * $Hsym);
  9.133 μs (13 allocations: 7.36 KiB)

eveningsilverfox · July 31, 2023, 10:42pm

I consider Hermitian H. Here it’s real symmetric just for simplicity.

DNF · July 31, 2023, 10:44pm

OK, it’s just that sometimes simplifying things makes it harder to follow and understand the problem. I suggest not simplifying to the point where essential properties of the problem are lost.

eveningsilverfox · July 31, 2023, 10:48pm

Sure. Fair point

PeterSimon · August 1, 2023, 2:49pm

@eveningsilverfox , by restructuring the code slightly in your Jtest.jl example, we can use @lmiq’s suggestion and safely incorporate multithreading for part of the computation. This provides a 20% or so speedup, not counting startup/compilation time. Whether or not this can also be used in your actual application depends on details you haven’t yet provided. More on this after the code and timing results.

Here is the contents of Jtest2.jl:

sing LinearAlgebra
using ExponentialUtilities
using ExponentialUtilities: alloc_mem
using MKL
BLAS.set_num_threads(1)


function computethreads()
    N1 = 25; N = 2*N1 #size of Hamiltonian matrix (H_) is 2*N by 2*N

    #Hamiltonian matrix H
    h0 = [-1 0; 0 1]
    hp = [-1 0; 0 1]
    hm = [-1 0; 0 1]
    H0 = kron(Diagonal(ones(N1)), h0) + kron(diagm(1=>ones(N1-1)), hp) + kron(diagm(-1=>ones(N1-1)), hm)
    H = [H0  zeros(ComplexF64, size(H0))
         zeros(ComplexF64, size(H0))  H0]
    #@show size(H)
    for mn = 1:N
        H[2*mn-1,2*mn] = H[2*mn,2*mn-1] = 1
    end
    H[(2*N1-1):2*N1, (2*N1+1):2*N1+2] .= [1 0; 0 -1]
    H[(2*N1+1):2*N1+2, (2*N1-1):2*N1] .= [1 0; 0 -1]
    
    #method = ExpMethodHigham2005() # fastest
    #cache = alloc_mem(H, method) 
    
    #time
    tmin = -20; tmax = 20;
    Nt0 = 200; tar0 = range(tmin,tmax,Nt0); deltat0 = tar0[2] - tar0[1]

    Uu1armax = zeros(ComplexF64, 2N, 2N, Nt0)
    for gh = Nt0:-1:2
        
        E1 = exponential!((-im * 0.001) * H);

        Uu1ar = (@view Uu1armax[:, :, 1:gh]); Uu1ar[:, :, gh] .= I(2N)
        f1 = exp(-0.5 * deltat0)

        nbt = BLAS.get_num_threads()
        BLAS.set_num_threads(1)
        Threads.@threads for hi = gh-1:-1:1
            Uu1ar[:, :, hi] .= exponential!(-im * deltat0 * H)
        end
        BLAS.set_num_threads(nbt)

        for hi = gh-1:-1:1
            Uu1ar[:, :, hi] .= f1 * Uu1ar[:, :, hi+1] * Uu1ar[:, :, hi]
        end
    end
end

computethreads();

and here are timing results using a fresh Julia REPL:

 ~/s/julia/e/maxtrix_exp  julia --project -t8 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.2 (2023-07-05)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time include("Jtest2.jl")
 12.108647 seconds (9.08 M allocations: 27.478 GiB, 6.97% gc time, 35.03% compilation time)

julia> @time computethreads();
  7.813464 seconds (431.13 k allocations: 26.940 GiB, 8.61% gc time)

The second run, after package loading and compilation, takes about 7.8 seconds versus 9.9 seconds for the original code in Jtest.jl.

In the code above, the array Uu1ar is used as temporary storage for multithreaded precomputation of all the matrix exponentials. You’ve previously stated that in your real application the matrix being exponentiated to form f2 depends on the loop index hi. So whether or not you can do the multithreading as shown depends on whether this matrix can be computed independently for arbitrary loop indices, independent of the order of loop iteration.

eveningsilverfox · August 1, 2023, 4:21pm

This is nice. H depends only on the present value of hi, so it is giving a good speed-up.

Topic		Replies	Views
Optimizing Sparse Matrix Exponential Computation in Julia General Usage question , package	2	271	September 19, 2024
Pseudo-inverse of large matrix very slow New to Julia question , performance , linearalgebra , python	20	4225	January 12, 2022
Speed of exp function for skew hermitian matrices New to Julia	12	839	February 17, 2022
Matrix exponential slower in Julia 0.7 / 1.0? Performance	7	2640	September 12, 2018
Slow matrix multiplication in Julia compared to Python numpy New to Julia question	17	5536	May 19, 2018

Speeding up matrix exponential and matrix multiplication

Related topics