Parallelizing a code when we are evaluating a function for different parameters independently and storing in an array

I am working on a problem where we have to evaluate a function at different values of a parameter called t, each evaluation is time consuming and involves a lot of matrix-multiplication and inversion. The rough outline of the function is given below. The key point is that it reads a few data files and initializes some parameter dependent matrices.

    function G(t::Float64,M::Matrix{ComplexF64},Mp::Matrix{ComplexF64})
        sysize = size(M)[1]
        S = zeros(ComplexF64,sysize,sysize)
        Sp = zeros(ComplexF64,sysize,sysize)
        Mp = Mp * 10^-6
        M = M * 10^-6
        for i in range(1,sysize)
            S[i,i] = M[i,i]*sinh(t)
            Sp[i,i] = Mp[i,i]*sinh(im*t)
        g = det(S*Sp*inv(S+Sp))*tr(inv(Sp*S+Sp^2))

        ## The point here is ----- that finding G(t) is time consuming because of matrix calculations
        return g

Now I need to find the value of the function for various values of t given M and Mp are constant matrices defined outside the function. For accomplishing this I have defined another function as below

function calc()
        t = range(start=-5,stop=5,length=100)
        y = zeros(Float64,100)
        for i in range(1,100)
            y[i] = real(G(t[i],M,Mp))

My Aim - To parallelize the code using Distributed.jl. I know that one entry of the array y would not depend on the other entry. Thus I would like to let different workers simultaneously fill different cells of the array.

Initial Try - I tried to break up the for loop in calc() into 10 parts. Evaluating one part by one worker. I declared y as an @everywhere variable. But it did not work out as all the workers seem to be operating on their own copy of y. Is there any way to synchronize the array y among all workers after each worker has filled their portion of the array ?

I am extremely new to parallel computing. I have tried to achieve the same using Threads.jl , the code ran successfully but the speed-up was not significant enough, besides it would not provide me with multi-cpu parallelization which is my end goal. Any pointers towards what can be done is greatly appreciated.

1 Like

Hey there! Before parallelization, you should first make your sequential code more efficient. In particular you should minimize the intermediate allocations (see the performance tips in the Julia manual). Also try to use division instead of explicit inv calls. Once you’ve done that you can think about parallelization, either with pmap from Distributed.jl or with Base.Threads. in any case you might need to limit the number of OpenBLAS threads, e.g. with BLAS.set_num_threads(1), to avoid oversubscription of your CPU cores.


Hi, thank you for your reply. What I provided as G(t::Float64,M::Matrix{Float64},Mp::Matrix{Float64}) is just an example. My actual code does not have initialization of S and Sp matrices inside the function. However, I do not understand your comment on not using inv . Could you please elaborate on how to use the alternative.
Also I am completely new to high performance computing, so some pointers towards how to implement pmap from Distributed.jl would be heavily appreciated !

That’s good. But you’re allocating many intermediate arrays. For example Mp = Mp * 10^-6 allocates, so does S*Sp and Sp^2 etc. You want to perform operations in-place if possible, that is reuse existing memory as much as possible.

I tried this, just created some toy M and Mp:

function calc()
    t = range(start=-5,stop=5,length=100)
    N = 500
    M = rand(N,N) .+ im.*rand(N,N)
    Mp = rand(N,N) .+ im.*rand(N,N)
    y = pmap(1:100) do i
        real(G(t[i], M, Mp))

It scales fairly well to 16 cpus on my computer. But there might be more opportunities for optimizations. I don’t use Distributed regularly.

1 Like

Hi, thank you for your kind help. I have a small question though, what is the difference between using the pmap and using @distributed macro in this regard ? Also, why are we not declaring M and Mp as SharedArrays ? While using @distributed we have to do it because otherwise Mp being only defined on the master processor cannot be accessed by the other workers.

The @distributed macro does not fetch the result, it merely chops the loop up in parts which are executed on other machines/processes. Optionally in @distributed you can specify a “reducer”, i.e. a function to accumulate the loop values. E.g. like

@distributed (+) for i in 1:100

This can also be used to construct an array:

@distributed hcat for i in 1:10
    [i, i^2]

You could have M and Mp as SharedArray. This would avoid copying them around as part of the pmap (provided SharedArrays uses shared memory, which I guess it does, it doesn’t say). This would not extend to multiple machines, that would require remote memory, which is available in MPI.jl in some fashion, I think. But MPI is a whole new world.

Anyway, I’d guess copying M and Mp around takes negligible time compared to multiplying them and computing determinants etc, but if they are very large you could run out of memory. The copying is only done once for each worker, as part of copying out the function (the body of the loop).

By the way, the pmap can also distribute over t (instead of over i):

y = pmap(t) do ti
    real(G(ti, M, Mp))