Parallelizing a code when we are evaluating a function for different parameters independently and storing in an array

Uranium238 · December 12, 2023, 5:47pm

I am working on a problem where we have to evaluate a function at different values of a parameter called t, each evaluation is time consuming and involves a lot of matrix-multiplication and inversion. The rough outline of the function is given below. The key point is that it reads a few data files and initializes some parameter dependent matrices.

    function G(t::Float64,M::Matrix{ComplexF64},Mp::Matrix{ComplexF64})
        sysize = size(M)[1]
        S = zeros(ComplexF64,sysize,sysize)
        Sp = zeros(ComplexF64,sysize,sysize)
        Mp = Mp * 10^-6
        M = M * 10^-6
        for i in range(1,sysize)
            S[i,i] = M[i,i]*sinh(t)
            Sp[i,i] = Mp[i,i]*sinh(im*t)
        end
        g = det(S*Sp*inv(S+Sp))*tr(inv(Sp*S+Sp^2))
        sleep(0.25)

        ## The point here is ----- that finding G(t) is time consuming because of matrix calculations
        return g
    end

Now I need to find the value of the function for various values of t given M and Mp are constant matrices defined outside the function. For accomplishing this I have defined another function as below

function calc()
        t = range(start=-5,stop=5,length=100)
        y = zeros(Float64,100)
        for i in range(1,100)
            y[i] = real(G(t[i],M,Mp))
        end
    end

My Aim - To parallelize the code using Distributed.jl. I know that one entry of the array y would not depend on the other entry. Thus I would like to let different workers simultaneously fill different cells of the array.

Initial Try - I tried to break up the for loop in calc() into 10 parts. Evaluating one part by one worker. I declared y as an @everywhere variable. But it did not work out as all the workers seem to be operating on their own copy of y. Is there any way to synchronize the array y among all workers after each worker has filled their portion of the array ?

I am extremely new to parallel computing. I have tried to achieve the same using Threads.jl , the code ran successfully but the speed-up was not significant enough, besides it would not provide me with multi-cpu parallelization which is my end goal. Any pointers towards what can be done is greatly appreciated.

carstenbauer · December 13, 2023, 5:54am

Hey there! Before parallelization, you should first make your sequential code more efficient. In particular you should minimize the intermediate allocations (see the performance tips in the Julia manual). Also try to use division instead of explicit inv calls. Once you’ve done that you can think about parallelization, either with pmap from Distributed.jl or with Base.Threads. in any case you might need to limit the number of OpenBLAS threads, e.g. with BLAS.set_num_threads(1), to avoid oversubscription of your CPU cores.

Uranium238 · December 13, 2023, 6:32am

Hi, thank you for your reply. What I provided as G(t::Float64,M::Matrix{Float64},Mp::Matrix{Float64}) is just an example. My actual code does not have initialization of S and Sp matrices inside the function. However, I do not understand your comment on not using inv . Could you please elaborate on how to use the alternative.
Also I am completely new to high performance computing, so some pointers towards how to implement pmap from Distributed.jl would be heavily appreciated !

carstenbauer · December 13, 2023, 8:50am

That’s good. But you’re allocating many intermediate arrays. For example Mp = Mp * 10^-6 allocates, so does S*Sp and Sp^2 etc. You want to perform operations in-place if possible, that is reuse existing memory as much as possible.

sgaure · December 13, 2023, 12:51pm

I tried this, just created some toy M and Mp:

function calc()
    t = range(start=-5,stop=5,length=100)
    N = 500
    M = rand(N,N) .+ im.*rand(N,N)
    Mp = rand(N,N) .+ im.*rand(N,N)
    y = pmap(1:100) do i
        real(G(t[i], M, Mp))
    end
end

It scales fairly well to 16 cpus on my computer. But there might be more opportunities for optimizations. I don’t use Distributed regularly.

Uranium238 · December 14, 2023, 7:33am

Hi, thank you for your kind help. I have a small question though, what is the difference between using the pmap and using @distributed macro in this regard ? Also, why are we not declaring M and Mp as SharedArrays ? While using @distributed we have to do it because otherwise Mp being only defined on the master processor cannot be accessed by the other workers.

sgaure · December 14, 2023, 9:06am

The @distributed macro does not fetch the result, it merely chops the loop up in parts which are executed on other machines/processes. Optionally in @distributed you can specify a “reducer”, i.e. a function to accumulate the loop values. E.g. like

@distributed (+) for i in 1:100
    i^2
end

This can also be used to construct an array:

@distributed hcat for i in 1:10
    [i, i^2]
end

You could have M and Mp as SharedArray. This would avoid copying them around as part of the pmap (provided SharedArrays uses shared memory, which I guess it does, it doesn’t say). This would not extend to multiple machines, that would require remote memory, which is available in MPI.jl in some fashion, I think. But MPI is a whole new world.

Anyway, I’d guess copying M and Mp around takes negligible time compared to multiplying them and computing determinants etc, but if they are very large you could run out of memory. The copying is only done once for each worker, as part of copying out the function (the body of the loop).

By the way, the pmap can also distribute over t (instead of over i):

y = pmap(t) do ti
    real(G(ti, M, Mp))
end

Topic		Replies	Views
Help with parallel computing for a simple loop with a large function New to Julia	4	1650	October 25, 2019
Probable data race condition causing problems when trying to parallelize a loop used to populate an array Performance distributed	14	193	August 4, 2024
For loop in function and multiplication of larger matrices, slow speed in parallel Performance performance , parallel , loops	3	1301	November 22, 2019
Parallelizing a for-loop with a matrix Performance	2	348	June 2, 2021
Trying to write a parallel for loop in Julia New to Julia	2	3179	June 6, 2020

Parallelizing a code when we are evaluating a function for different parameters independently and storing in an array

Related topics