How to implement hybrid parallelized programs

jsjie · February 8, 2024, 3:11am

If you are simply diagonalizing matrices, then iterative techniques should normally be of not much help. It is a must in planewave-DFT because the matrix can be as large as 1m * 1m, while you only need 1k eigenvectors.
And, in most iterative techniques, you will make a reduced space, which is still larger than the number you need (most of the time it will be twice as large), and then perform direct diagonalization in the reduced space. Because of this, it may be a waste of time to calculate all eigenvectors with zhegv. Instead, with zhegvx you can set the number of eigenvectors you need. zhegvd will also calculate all eigenvectors, but it uses a divide-and-conquer algorithm and is generally faster than zhegv.
I’m using the gv variants because we need that in DFT, but it should be the same for the ev variants. That being said, you may want to have a try to zheevd, and even zheevx if you need only some of the eigenvectors. However, if your matrix is small, there should be little difference.

jsjie · February 8, 2024, 3:19am

If the assignment of off-diagonal parts are the most time-consuming, it may be helpful to rearrange arr_1 so that it can be achieved as arr_i[1, j, i]. But since it’s only approximately 10 * 10 * 3, it should have little impact.

And I’ve tried the following:

M = 10; N = 1000;  y = zeros(ComplexF64, M*N, M*N); a = zeros(Int, M, M); x = zeros(ComplexF64, N, N); rand!(x);
for i = 1 : M * M
    a[i] = Int(round(rand()*5))
end
function set_1!(y, a, x, M, N)
    for i = 1 : M, j = 1 : i
        y[1+N*(i-1):N*i, 1+N*(j-1):N*j] = a[i, j] .* x
    end
end
function set_2!(y, a, x, M, N)
    for i = 1 : M, j = 1 : i
        y[1+N*(i-1):N*i, 1+N*(j-1):N*j] .= a[i, j] .* x
    end
end

and get the following results:

julia> @btime set_1!(y, a, x, M, N)
  338.428 ms (110 allocations: 839.24 MiB)

julia> @btime set_2!(y, a, x, M, N)
  90.999 ms (0 allocations: 0 bytes)

I then tried directly calling BLAS:

function set_4!(y, a, x, M, N)
    for p = 1 : N
        ptr_x = pointer(x, 1+N*(p-1))
        for i = 1 : M, j = 1 : i
            coeff = a[i, j]
            ptr_y = pointer(y, 1+N*M*(N*(j-1)+p-1)+N*(i-1))
            ccall((BLAS.@blasfunc("zaxpby_"), BLAS.libblas), Cvoid, (Ref{BlasInt}, Ref{ComplexF64}, Ptr{ComplexF64}, Ref{BlasInt}, Ref{ComplexF64}, Ptr{ComplexF64}, Ref{BlasInt}), N, coeff, ptr_x, 1, zero(ComplexF64), ptr_y, 1)
        end
    end
end

and the result is

julia> @btime set_4!(y, a, x, M, N)
  69.349 ms (0 allocations: 0 bytes)

why 4? because set_3! is just like it, but p = 1 : N is the innermost loop. It is slightly slower then set_2!, so I don’t put it here.

Oscar_Smith · February 8, 2024, 3:49am

One minor note that Int(round(rand()*5)) could just be rand(0:5)

abraemer · February 8, 2024, 11:22am

It appears to me that mat_h_plus! and mat_h_minus! construct spin raising and lowering operators for spins of size M (the factor of im there seems a bit non-standard to me and sqrt(ind) is not quite right, but I think you simplified for the purpose of posting the code). Anyways the point is mat_plus and mat_minus are rather sparse with (M-1)/M^2 \propto \mathcal{O}(1/M) entries. So mat_x and mat_y are similarly sparse. The following code

for i = 1:M, j=1:i
    res[1+N*(i-1):N*i, 1+N*(j-1):N*j] = mat_x .* arr_1[i,j,1]  .+  mat_y .* arr_1[i,j,2]  .+  coeff_b .* arr_1[i,j,3]
end

essentially performs a few Kronecker products and should be equivalent to

res = @views kron(arr_1[:,:,1], mat_x) + kron(arr_1[:,:,2], mat_y) + kron(arr_1[:,:,3], coeff_b * I(M))

The second loop can be simplified similarly and should be equivalent to:

res += kron(Diagonal(mat_1), I(M)) + I(M*N)*coeff_a

So for better performance, I would use sparse matrices. Here is an attempt:

using SparseArrays

# HELPER FUNCTIONS 
function mat_h_plus(size, coeff::Real)
    I = 1:size-1
    J = 2:size
    V = im*coeff*sqrt.(I)
    return sparse(I,J,V, size,size) # constructs a sparse size x size matrix
    # size_o = size(original, 1)
    # for ind in 1:size_o-1
    #     original[ind,ind+1] = im * coeff * sqrt(ind) # procedure like this
    # end
end

function mat_h_minus() # similar to mat_h_plus
end

# MAIN FUNCTION
function make_matrix(
    N::Integer, # large integer (~1000)
    M::Integer, # small integer (~10)
    mat_1::Vector{ComplexF64}, # size M
    arr_1::Array{ComplexF64, 3}, # MxMx3 array
    coeff::Real # coefficients for mat_hs
    )
   
    res = zeros(ComplexF64, M*N, M*N)

    mat_plus = mat_h_plus(M, coeff)
    mat_minus = mat_h_minus(M, coeff) # actually shouldn't that just be the Hermitian conjugate of mat_plus?
    # mat_minus = mat_plus' # hermitian conjugate

    mat_x = (mat_plus - mat_minus)/sqrt(2) # something like this
    mat_y = # similar to calculation above
    # diag = coeff_a * I(M) # diagonal matrix

    # assign values for off-diagonal parts
    res .+= @views kron(arr_1[i,j,1], mat_x)
    res .+= @views kron(arr_1[i,j,2], mat_y)
    res .+= @views kron(arr_1[i,j,3], coeff_b*I(M))
    # for i = 1:M, j=1:i
    #     (
    #     res[1+N*(i-1):N*i, 1+N*(j-1):N*j] 
    #         = mat_x .* arr_1[i,j,1]  .+  mat_y .* arr_1[i,j,2]  .+  coeff_b .* arr_1[i,j,3]
    #     )
    # end
    # diagonal parts
    res .+= kron(Diagonal(mat_1), I(M))
    res .+= coeff_a*I(M*N)
    # for i=1:M
    #     (
    #     res[1+N*(i-1):N*i, 1+N*(i-1):N*i]
    #         .+= diag  .+   mat_1[i] * I(M)
    #     )
    # end

    return Hermitian(res)
end

The next low-hanging fruit here is to move the allocation of res outside of make_matrix to only allocate it once per thread as I wrote earlier.
Another improvement would be use a better in-place addition by replacing the .+= with a custom addition function like:

function myadd!(dense::Matrix, sparse::SparseMatrixCSC)
    for col in 1:size(sparse, 2)
        for r in nzrange(sparse, col)
            dense[rowvals(sparse)[r], col] += nonzeros(sparse)[r]
        end
    end
    return dense
end

mdyuki · February 10, 2024, 6:12am

Thank you.

For our final result, we should obtain all the eigenvalues of given matrices. So it might not be useful to use iterative techinques or subroutines like zhegvx.

Actually, the arr_1[:,:,i] is passed by a function in a form of matrices. Rearranging them might increase numbers of assignment.

I will try your last idea to call blas subroutines directry. Thank you.

mdyuki · February 10, 2024, 6:27am

Thank you for the reply.

abraemer:

The following code

for i = 1:M, j=1:i
    res[1+N*(i-1):N*i, 1+N*(j-1):N*j] = mat_x .* arr_1[i,j,1]  .+  mat_y .* arr_1[i,j,2]  .+  coeff_b .* arr_1[i,j,3]
end

essentially performs a few Kronecker products and should be equivalent to

res = @views kron(arr_1[:,:,1], mat_x) + kron(arr_1[:,:,2], mat_y) + kron(arr_1[:,:,3], coeff_b * I(M))

I tried three versions from your advices.

original code. it takes 13 minutes, and make 38 GB allocations.
use kronecker product. 10 minutes, 69 GB
kronecker product + sparse arrays. 9 minutes, 69 GB
only sparse arrays. 9 minutes, 22 GB

Considering those experiments, I decided to use sparse arrrays, and not to use kronecker product methods. It reduces roughly 30 % of time and 40 % of memory allocations.

abraemer · February 10, 2024, 6:34am

When you say “without kronecker products” you mean you build the sparse array directly from a loop? That would of course be more efficient but also a bit more difficult, so I didn’t propose it originally. If you post the current code again, I’ll have another look and see whether we can make it even faster

Edit: Does M stay constant the whole time? If so it would make sense to construct the matrices mat_plus,… only once and not on every call to make_matrix.

mdyuki · February 10, 2024, 7:01am

I wanted to say that I don’t compute elements of res_mat with kronecker product.

Certainly, M is fixed for whole the iterations, but make_matrix has other argument called x, which influences the mat_plus through coeff argument. So I should make mat_plus for every iterations.

Below is my current code for make_matrix:

using SparseArrays

# HELPER FUNCTIONS 
function mat_h_plus(size, coeff::Real)
    I = 1:size-1
    J = 2:size
    V = coeff*sqrt.(I)
    return sparse(I,J,V, size,size) # constructs a sparse size x size matrix
end


# MAIN FUNCTION
function make_matrix(
    N::Integer, # large integer (~1000)
    M::Integer, # small integer (~10)
    mat_1::Vector{ComplexF64}, # size M
    arr_1::Array{ComplexF64, 3}, # MxMx3 array
    coeff::Real # coefficients for mat_hs
    )
   
    res = zeros(ComplexF64, M*N, M*N)

    mat_plus = mat_h_plus(M, coeff)
    mat_minus = adjoint(mat_plus) # transpose conjugate of mat_plus

    mat_x = (mat_plus - mat_minus)/sqrt(2) # something like this
    mat_y = # similar to calculation above

    # assign values for off-diagonal parts
    for i = 1:M, j=1:i
         (
         res[1+N*(i-1):N*i, 1+N*(j-1):N*j] 
             = mat_x .* arr_1[i,j,1]  .+  mat_y .* arr_1[i,j,2]  .+  coeff_b .* arr_1[i,j,3]
         )
    end
    # diagonal parts
    for i=1:M
        (
         res[1+N*(i-1):N*i, 1+N*(i-1):N*i]
            .+= coef.*I(M)  .+   mat_1[i] * I(M)
        )
    end

    return Hermitian(res)
end

And this is main part:

# calculate needed values for x
function calc_function(M, N, x)
	kvalue_range = range(0, 0.5, 201)
	result_vector = zeros(201)

    mat_1, arr_1 = # diagonalization of size M(~10)
    coeff = coeff_func(x) # some arithmetics
    Threads.@threads for k_ind in 1:201
		kvalue = kvalue_range[k_ind]
		energy = eigvals!( make_matrix(N, M, mat_1, arr_1, mat_1, coeff)
			# we need different matrices for each kvalues
		result_vector[k_ind] = sub_calculation(energy)
    end

	result = integrate(result_vector)	
		# integrate uses the trapezoidal rule: just adding values
 	retun result
end

const dx = 0.01 # for differentiation
container = zeros(201)
for ind in eachindex(container)
    x = 0.1 * (ind-1)
	Res_minus = calc_func(x-dx)
	Res_plus = calc_func(x+dx)
	container[ind] = (Res_plus - Res_minus) / (2*dx)
end

# some IOs

abraemer · February 12, 2024, 6:38am

Note though that in this code. mat_plus, mat_minus and thus also mat_x and mat_y are directly proportional to coeff. So you could still build them once and then reuse. Also you really should reuse the space for res in make_matrix like I suggested earlier.

I have the strong feeling that the code you posted is not your true code (spelling mistakes, syntax errors,…), so I won’t make the effort of restructuring it, because I don’t know whether these transformations would apply to your true code. If you post what you real code does here, then I’ll gladly show you how to restructure it to gain more performance.

mdyuki · February 12, 2024, 8:31am

@abraemer

I’m sorry, but it is hard to provide a more detailed code example since the real code involves information we have not published yet.

Thank you so much for all your advice. It was helpful for more efficient calculation by methods other than parallelization, which was my first purpose.

Topic		Replies	Views
BLAS vs Threads on a cluster Performance	6	526	April 23, 2024
Parallel computing with * Performance question	27	1111	December 29, 2022
Multithreading using more CPUs than expected Performance	11	540	July 20, 2023
How do I use multithreaded BLAS in each MPI process Julia at Scale question	0	694	June 1, 2020
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	410	November 13, 2023

How to implement hybrid parallelized programs

Related topics