Hi, I need help in pre-allocating some inner-variables of a for loop to reduce allocations in a multi-threaded code. I am creating multiple local arrays inside the for loop for intermediate calculations. Here’s a MWE with pre-allocation and parallelised using Distributed. How do I make this a multi-threaded code without running into data-races and avoiding multiple allocations?
function test_fun(N,N_l)
r_c_vector = rand(N,3) #input
F_r_c_vector = rand(N,3) #input
R_star = 1 #input
l_vector = rand(N_l,3) #input
u_Real = SharedArray{Float64}(N,3) #output
#Pre-allocation
dx_temp_1_full = zeros(length(r_c_vector[:,1]))
dx_temp_2_full = similar(dx_temp_1_full)
dx_temp_3_full = similar(dx_temp_1_full)
dx_temp_full = similar(dx_temp_1_full)
@sync @distributed for i=1:N
temp_u_prime_1 = 0
temp_u_prime_2 = 0
temp_u_prime_3 = 0
for l in eachindex(l_vector[:,1])
dx_temp_1_full .= r_c_vector[i,1].-r_c_vector[:,1] .- l_vector[l,1]
dx_temp_2_full .= r_c_vector[i,2].-r_c_vector[:,2] .- l_vector[l,2]
dx_temp_3_full .= r_c_vector[i,3].-r_c_vector[:,3] .- l_vector[l,3]
dx_temp_full .= sqrt.(dx_temp_1_full.^2 .+ dx_temp_2_full.^2 .+ dx_temp_3_full.^2)
#Only take points within R_star
cutoff_ids = findall(dx_temp_full .< R_star)
if isempty(cutoff_ids)
continue
end
dx_temp = dx_temp_full[cutoff_ids]
dx_temp_1 = dx_temp_1_full[cutoff_ids]
dx_temp_2 = dx_temp_2_full[cutoff_ids]
dx_temp_3 = dx_temp_3_full[cutoff_ids]
A11 = exp.(dx_temp.^2 .+ dx_temp_1.^2) #this is more complex
A12 = exp.(dx_temp_1.*dx_temp_2)
A13 = exp.(dx_temp_1.*dx_temp_3)
A22 = exp.(dx_temp.^2 .+ dx_temp_2.^2)
A23 = exp.(dx_temp_2.*dx_temp_3)
A33 = exp.(dx_temp.^2 .+ dx_temp_3.^2)
temp_u_prime_1 = temp_u_prime_1 + sum(A11.*F_r_c_vector[cutoff_ids,1] .+ A12.*F_r_c_vector[cutoff_ids,2] .+ A13.*F_r_c_vector[cutoff_ids,3])
temp_u_prime_2 = temp_u_prime_2 + sum(A12.*F_r_c_vector[cutoff_ids,1] .+ A22.*F_r_c_vector[cutoff_ids,2] .+ A23.*F_r_c_vector[cutoff_ids,3])
temp_u_prime_3 = temp_u_prime_3 + sum(A13.*F_r_c_vector[cutoff_ids,1] .+ A23.*F_r_c_vector[cutoff_ids,2] .+ A33.*F_r_c_vector[cutoff_ids,3])
end
u_Real[i,:] = [temp_u_prime_1,temp_u_prime_2,temp_u_prime_3]
end
return u_Real
end
For multi-threading, i had to stop pre-allocation due to data races but then it makes it slightly worse than the distributed case. I have read that multi-threading is supposed to be faster than distributed computing and so I want to make this switch.
PS: I believe I have made the serial job as fast as I can but if you have any suggestions, I’d be grateful for them.