Hi Julia community!

I need help on speeding up a function with parallelization. Here’s a MWE.

```
using Distributed; n_cores = 3; addprocs(n_cores)
@everywhere using SharedArrays, BenchmarkTools
M = 10000; N = 10000; k = 2; mu = rand(N, 1)
U1 = SharedArray{Float32}(rand(M, N))
U2 = SharedArray{Float32}(rand(M, N))
idv_d = SharedArray{Float32}(N, M)
d = SharedArray{Float32}(M, 1)
function get_d!(U1, U2, mu, k, d, idv_d)
fill!(idv_d, 0.0)
@inbounds @sync @distributed for j in 1:N
u_1, arg_1 = findmax(@view U1[:, j])
u_2, arg_2 = findmax(@view U2[:, j])
if u_1 >= u_2
idv_d[j, arg_1] = mu[j]
else
idv_d[j, arg_2] = mu[j] ./ k
end
end
@inbounds @sync @distributed for j in 1:M
d[j] = sum(idv_d[:, j])
end
end
@btime get_d!(U1, U2, mu, k, d, idv_d)
```

`M`

and `N`

will be two large numbers (on the order of 10^6), and I intend to run this on clusters so `n_cores`

will be about 50-100. The function calculates aggregate demand given two utility matrices (`U1`

, `U2`

). The only output that I’m interested in is `d`

not `idv_d`

, but I don’t see a fast way to get `d`

with parallelization without calculating `idv_d`

first.

I have tried to optimize the code for a while and this is the best I can come up with. I noticed that looping over columns helps a lot. However, I wonder whether I’m leaving some performance on the table. Are there any ways to speed this up? I have no concerns about readability. Would really appreciate any help!