# Parallelizing a code when we are evaluating a function for different parameters independently and storing in an array

I am working on a problem where we have to evaluate a function at different values of a parameter called `t`, each evaluation is time consuming and involves a lot of matrix-multiplication and inversion. The rough outline of the function is given below. The key point is that it reads a few data files and initializes some parameter dependent matrices.

``````    function G(t::Float64,M::Matrix{ComplexF64},Mp::Matrix{ComplexF64})
sysize = size(M)[1]
S = zeros(ComplexF64,sysize,sysize)
Sp = zeros(ComplexF64,sysize,sysize)
Mp = Mp * 10^-6
M = M * 10^-6
for i in range(1,sysize)
S[i,i] = M[i,i]*sinh(t)
Sp[i,i] = Mp[i,i]*sinh(im*t)
end
g = det(S*Sp*inv(S+Sp))*tr(inv(Sp*S+Sp^2))
sleep(0.25)

## The point here is ----- that finding G(t) is time consuming because of matrix calculations
return g
end
``````

Now I need to find the value of the function for various values of `t` given `M` and `Mp` are constant matrices defined outside the function. For accomplishing this I have defined another function as below

``````function calc()
t = range(start=-5,stop=5,length=100)
y = zeros(Float64,100)
for i in range(1,100)
y[i] = real(G(t[i],M,Mp))
end
end
``````

My Aim - To parallelize the code using Distributed.jl. I know that one entry of the array `y` would not depend on the other entry. Thus I would like to let different workers simultaneously fill different cells of the array.

Initial Try - I tried to break up the for loop in `calc()` into 10 parts. Evaluating one part by one worker. I declared `y` as an `@everywhere` variable. But it did not work out as all the workers seem to be operating on their own copy of `y`. Is there any way to synchronize the array `y` among all workers after each worker has filled their portion of the array ?

I am extremely new to parallel computing. I have tried to achieve the same using Threads.jl , the code ran successfully but the speed-up was not significant enough, besides it would not provide me with multi-cpu parallelization which is my end goal. Any pointers towards what can be done is greatly appreciated.

1 Like

Hey there! Before parallelization, you should first make your sequential code more efficient. In particular you should minimize the intermediate allocations (see the performance tips in the Julia manual). Also try to use division instead of explicit `inv` calls. Once youâ€™ve done that you can think about parallelization, either with `pmap` from Distributed.jl or with Base.Threads. in any case you might need to limit the number of OpenBLAS threads, e.g. with `BLAS.set_num_threads(1)`, to avoid oversubscription of your CPU cores.

2 Likes

Hi, thank you for your reply. What I provided as `G(t::Float64,M::Matrix{Float64},Mp::Matrix{Float64})` is just an example. My actual code does not have initialization of `S` and `Sp` matrices inside the function. However, I do not understand your comment on not using `inv` . Could you please elaborate on how to use the alternative.
Also I am completely new to high performance computing, so some pointers towards how to implement `pmap` from Distributed.jl would be heavily appreciated !

Thatâ€™s good. But youâ€™re allocating many intermediate arrays. For example `Mp = Mp * 10^-6` allocates, so does `S*Sp` and `Sp^2` etc. You want to perform operations in-place if possible, that is reuse existing memory as much as possible.

I tried this, just created some toy `M` and `Mp`:

``````function calc()
t = range(start=-5,stop=5,length=100)
N = 500
M = rand(N,N) .+ im.*rand(N,N)
Mp = rand(N,N) .+ im.*rand(N,N)
y = pmap(1:100) do i
real(G(t[i], M, Mp))
end
end
``````

It scales fairly well to 16 cpus on my computer. But there might be more opportunities for optimizations. I donâ€™t use `Distributed` regularly.

1 Like

Hi, thank you for your kind help. I have a small question though, what is the difference between using the `pmap` and using `@distributed` macro in this regard ? Also, why are we not declaring `M` and `Mp` as `SharedArrays` ? While using `@distributed` we have to do it because otherwise `Mp` being only defined on the master processor cannot be accessed by the other workers.

The `@distributed` macro does not fetch the result, it merely chops the loop up in parts which are executed on other machines/processes. Optionally in `@distributed` you can specify a â€średucerâ€ť, i.e. a function to accumulate the loop values. E.g. like

``````@distributed (+) for i in 1:100
i^2
end
``````

This can also be used to construct an array:

``````@distributed hcat for i in 1:10
[i, i^2]
end
``````

You could have `M` and `Mp` as `SharedArray`. This would avoid copying them around as part of the `pmap` (provided `SharedArrays` uses shared memory, which I guess it does, it doesnâ€™t say). This would not extend to multiple machines, that would require remote memory, which is available in `MPI.jl` in some fashion, I think. But MPI is a whole new world.

Anyway, Iâ€™d guess copying `M` and `Mp` around takes negligible time compared to multiplying them and computing determinants etc, but if they are very large you could run out of memory. The copying is only done once for each worker, as part of copying out the function (the body of the loop).

By the way, the pmap can also distribute over `t` (instead of over `i`):

``````y = pmap(t) do ti
real(G(ti, M, Mp))
end
``````