# How can I speed up a batch of numerical integrations from a shared array?

I am trying to carry out numerical integrations on a large number of initial conditions, storing them in a shared array:

``````D=6
N=1000
T=150
state=SharedArray{Float64}((D,N,T))
state[:,:,1]=randn(D,N);

@everywhere include("my_integration_funcs.jl")

@sync @parallel for t in 2:T
state[:,:,t]=timestep(state[:,:,t-1])
end
``````

where “my_integration_funcs.jl” looks like:

``````	function timestep(init)
dt=0.0002
T=1
for t in 0.:dt:T
init=init+dt*CdV_eq(arr)
end
return init
end

#a big horrible nonlinear equation:
function CdV_eq(arr)

if size(arr)==D
x1=arr[1,:]
x2=arr[2,:]
x3=arr[3,:]
x4=arr[4,:]
x5=arr[5,:]
x6=arr[6,:]
else
println("first dimension should be length \$D")
return -1
end
b=0.5
gamma=0.2
beta=1.25
C=0.1
x1f=0.95
x4f=-0.76095

ẋ1=7.*x3 .-C.*(x1.-x1f)
ẋ4=7.*x6 .-C.*(x4.-x4f)+(x2.*x6-x3.*x5)
ẋ2=-0.2.*x1.*x3 .+1.25.*x3.-C.*x2.-3.2.*x4.*x6
ẋ3= 0.2.*x1.*x2 .-1.25.*x2.-C.*x3.+3.2.*x4.*x5 .-0.1.*x1
ẋ5=-0.6.*x1.*x6 .+3.*x6.-C.*x5.-3.1.*x4.*x3
ẋ6= 0.6.*x1.*x5 .-3.*x5.-C.*x6.+3.2..*x4.*x2 .-0.1.*x4
return transpose(hcat(ẋ1,ẋ2,ẋ3,ẋ4,ẋ5,ẋ6))
end

``````

When I run this code with different numbers of processors, I see very limited speedup going from 2procs to 4 procs (takes 80% of the time), but this is an embarrassingly parallel process, so I feel I should be able to do much better. How can I improve this?

(I am currently using Julia v0.6 which may be relevant)

I wouldn’t use a SharedArray. I would run each of the calculations entirely independently and collect the data at the end. SharedArray is sharing the data between the processes which is exactly what you don’t want in an embarrassingly parallel application.

See http://docs.juliadiffeq.org/latest/features/monte_carlo.html for the suggested DifferentialEquations.jl setup. Also, if it’s not on a cluster, I would suggest multithreading instead of multiprocessing to reduce the overhead. If the integration is long enough though then you should get almost perfect scaling from either, and if it’s not long enough then multithreading has a lower threshold for that then multiprocessing (but with the limitation of only working on a single machine. To mix them, use `parallel_type = :split_threads` in the Monte Carlo solve for example).

1 Like