How to implement hybrid parallelized programs

Hello. I have scientific computing codes that contain nested loops. I have parallelized them so far with multithreading one of the loops. This improved performance, but I need to reduce more computational time. To increase efficiency, I would like to try hybrid parallelization by making one loop run on multiprocesses.

How can I implement hybrid parallelism with M processes running N threads for each? I prefer built-in functionality like Distributed to other packages. Is it possible to specify the number of threads running on each process?

In the future, I would like to run hybrid parallelized programs on a small cluster by connecting several computers with ssh. Can I do that?

[added: Feb. 2nd, 17:30]
In my program, there are diagonalizations of large (dimensions of thousands and more) size of matrices, and this procedure is in the innermost loops. The computation resources are exhausted by running it on a single computer. So, I am planning to use multiple computers simultaneously.

1 Like

Using more processes generally won’t increase efficiency. As long as everything is on the same machine threads are usually the best option. There are some corner cases with using BLAS (excellent writeup here).

Without knowing a bit more about your code it will be very hard to help you. Have you checked that the CPU usage is not maxed, so there is room for improvement? Are there perhaps other improvements you can do to the code, e.g. reduce allocations as this hinders scaling with threads.

Multiple processes and Distributed.jl will later help you to distribute the computation to multiple computers but ofc it still makes sense to optimize the code beforehand.

3 Likes

Thank you. In my opinon, most heavy part of my programs is diagonalization of larger matrices in innermost loops. So most computation time are used for diagonalizations. and I have no idea to improve performance of diagonalization. Are ther any ways?

That uses BLAS and there are some pitfalls concerning threading and BLAS since BLAS manages its own threads. I highly recommend reading the writeup linked above](Julia Threads + BLAS Threads · ThreadPinning.jl) for details.

The short form is: You should set BLAS.set_num_threads(1) and use as many Julia threads as there core if your total memory fits having one matrix per thread. If you don’t have sufficient memory it depends on whether you use MKL.jl or not. With MKL.jl it is easier as you can set BLAS to use 2 threads and halve the number of Julia threads. With OpenBLAS you’ll need to use processes to achieve a similar setup.

Anyways, you should just try whether MKL is faster or slower for you computation before you start worry about the threading.

Another thing to consider: Do you actually need the full spectrum of the matrices? E.g. in many quantum physics applications one can use alternatives such as sparse diagonalization or Krylov methods.

2 Likes

You may also look into reducing memory allocations. The in-place eigen! might help

1 Like

If you are considering using a cluster of CPUs, also consider running the diagonalization on GPU (eg cuSOLVER). That is wrapped by CUDA.jl .

If you’re bottlenecked on “standard” linear algebra, then a GPU tends to be better bang for the buck than throwing more processors at it.

1 Like

Thank you. Actually I am a PhD student in a university, and there is no GPUs on computers in my laboratory now. So I try to ask my supervisor to install a GPU on a computer.

I didn’t know the eigen! method. It may help me. Thank you.

Thank you. I will consider the threading settings for BLAS. I didn’t pay any attention for this subject.

Unfortunately, I need all the spectrum of the matrices. So I may not be able to use other alorithms.

By setting the number of threads for BLASappropriately and using the eigvals! method, I could make my programs a little faster. Thank you.

However, I still need to reduce computation time much more. May I have some comments on hybrid parallelization in Julia?

When you can run your program with a much threads as you have cores, there is likely nothing to gain from using more threads or combining processes with threads. You’ll need to optimize the program itself.
Have you profiled your code and checked where it spends mosts of its time? Can you make improvements there, e.g. reduce allocations, fix type instabilities?

Generic comments are only moderately helpful and my crystal ball is unfortunately out of order. Can you post your code (or a reduced version), so we can have a look?

For multi-machine parallelism have a look at Distributed.jl and SharedArrays.jl. If you need/want some help using these with your code, please post some reduced toy version of your code and what you tried so far :slight_smile:

2 Likes

Here is a simplified version of my code.
We need differentiated values of calc_function(x). In the function, we have to integrate value for sub_calculation(), which uses the spectrum of a matrix for each k value. The loop for k values is the innermost loop because I want to parallelize outer loops by multiprocessing in the future. For efficiency, I tried pre-allocating, vectorization, and accessing arrays along columns.

In my first plan, I wanted to use several computers and make them run calc_function() for different x values. I researched about the combination of Threads.jl and Distributed.j;, and I found that we can make hybrid parallelization by simply specifying option -p m -t n in one computer system. But I cannot find out how to run nthreads for each process.

# calculate needed values for some x
# it contains loop for kvalues, which is used for integration
function calc_function(x)
	kvalue_range = range(0, 0.5, 201)
	result_vector = zeros(201)
	Threads.@threads for k_ind in 1:201
		kvalue = kvalue_range[k_ind]
		energy = eigvals!( make_matrix(kvalue) ) 
			# make_matrix make a thousands x thousands matrix
			# we need different matrices for each kvalues
		result_vector[k_ind] = sub_calculation(energy)
	end

	result = integrate(result_vector)	
		# integrate uses the trapezoidal rule: just adding values
 	retun result
end

const dx = 0.01 # for differentiation
container = zeros(201)
for x in 0:0.1:20
	Res_minus = calc_func(x-dx)
	Res_plus = calc_func(x+dx)
	Final_result = (Res_plus - Res_minus) / (2*dx)
end

# open file and write result in `container`
# 

Thanks for the simplified code :slight_smile:
I think you might be able to gain quite some performance by reusing the memory of the matrix. Assuming you can make a version of make_matrix that constructs the matrix into a preallocated matrix, then you could try:

using ChunkSplitters
function calc_function(x)
    kvalue_range = range(0, 0.5, 201)
    result_vector = zeros(201)
    for chunk in chunks(kvalue_range; n=Threads.nthreads())
        Threads.@spawn begin
            temp_matrix = zeros(1000,1000) # or whatever type/size the matrix needs to have
            for k_ind in chunk
                kvalue = kvalue_range[k_ind]
                energy = eigvals!( make_matrix!(temp_matrix, kvalue) ) 
                    # make_matrix make a thousands x thousands matrix
                    # we need different matrices for each kvalues
                result_vector[k_ind] = sub_calculation(energy)
              end
          end
      end

    result = integrate(result_vector)	
        # integrate uses the trapezoidal rule: just adding values
    retun result
end

This allocates only 1 matrix for each thread. You could profile sub_calcuation separately to make sure it does not allocate or else try to preallocate some more stuff. integrate is probably less important as it is called much less often. But you should profile and verify that it has no large chunk of runtime.

Than kyou for your advice. I tried splitting loops for chunks and defining temp_matrix, but there is little decrease of time. By using @profile macro, I found that main bottle necks are multiplications and summations of large matrices in make_matrix function. Aren’t there any measures to make those methods faster than BLAS?

Well BLAS is used under hood for most basic linear algebra stuff, but it is not magic. Why don’t you post your make_matrix code so we can have look together? :slight_smile:

  1. eigen! is good, still has some overheads. Directly calling lapack will be better. Besides, you may carefully choose among zhegvx, zhegv and zhegvd the best for your problem, while some of them is not supported by LinearAlgebra.jl. You may try Scalapack or ELPA, but their support is rather poor currently in Julia.
  2. Are you doing DFT calculations? If so, and if you are using iterative diagonalizing techniques, there are chances that you can rewrite your algorithm in a blocked manner, which will greatly reduce the computational cost. Roughly speaking, instead of diagonalizing a 800 * 800 matrix, you may diagonalize a 8 * 8 matrix for 100 (maybe 200, or 400) times. Since the cost increases cubicly with the matrix size, you will have a 10000x (maybe 5000x or 2500x) gain.

There is also FastLapackInterface.jl that allows for preallocation of workspaces so you can get repeated diagonalizations without allocations (and without diving into the LAPACK internals). But usually for large matrices I’d say eigen! is sufficient because the cost of allocating a bit of additional workspace is dwarfed by the time the diagonalization takes.

For allocation, yes, but then comes the argument later that you may want to use some subroutines other then those hides under eigen!, like zhegv. If you want only some of the eigenvalues and eigenvectors, then zhegvx or zhegvd may be much better than zhegv. This is really the case in DFT calculations, which I highly suspect that the author is doing.

This is my code. The profiler says that most of time is used for multiplications and summations appeared in main loops.

# HELPER FUNCTIONS 
function mat_h_plus!(original::Array{ComplexF64}, coeff::Real)
    size_o = size(original, 1)
    for ind in 1:size_o-1
        original[ind,ind+1] = im * coeff * sqrt(ind) # procedure like this
    end
end

function mat_h_minus!() # similar to mat_h_plus
end

# MAIN FUNCTION
function make_matrix(
    N::Integer, # large integer (~1000)
    M::Integer, # small integer (~10)
    mat_1::Vector{ComplexF64}, # size M
    arr_1::Array{ComplexF64, 3}, # MxMx3 array
    coeff::Real # coefficients for mat_hs
    )
   
    res = zeros(ComplexF64, M*N, M*N)

    mat_plus = zeros(ComplexF64, M, M)
    mat_minus = zeros(ComplexF64, M, M)
    mat_h_plus!(mat_plus, coeff)
    mat_h_minus!(mat_minus, coeff) 

    mat_x = (mat_plus - mat_minus)/sqrt(2) # something like this
    mat_y = # similar to calculation above
    diag = coeff_a * I(M) # diagonal matrix

    # assign values for off-diagonal parts
    for i = 1:M, j=1:i
        (
        res[1+N*(i-1):N*i, 1+N*(j-1):N*j] 
            = mat_x .* arr_1[i,j,1]  .+  mat_y .* arr_1[i,j,2]  .+  coeff_b .* arr_1[i,j,3]
        )
    end
    # diagonal parts
    for i=1:M
        (
        res[1+N*(i-1):N*i, 1+N*(i-1):N*i]
            .+= diag  .+   mat_1[i] * I(M)
        )
    end

    return Hermitian(res)
end

Thank you.

I simply diagonalize Hermitian matrices, so eigen!() should call zheev. I think it is sufficient.

Our calculation is not a DFT. Though I don’t know wheter we can use iterative techniques for our calculations, I’ll study and try them.