Ad 1:
I tested this a bit more and it seems that the memory allocation and deallocation adds considerable overhead (the whole GSA process allocated around 200GB overall). So I have changed from solving a single Ensemble to solving chunks of the parameter space at a time and reassembling the output array afterwards (code is not yet cleaned up but does work).
# Define GSA solver function:
ensemble_chunked = function (p)
global parameterset = p
Np::Int64 = size(p,2)
println("Running ", Np, " cases.")
chunksize::Int64 = Np/chunks
println("Using ", chunks, " chunk(s) of size ", chunksize, ".")
out = zeros(3,Np)
for k in 1:chunks
offset = Int((k - 1) * Np/chunks)
startindx = Int(offset + 1)
endindx = Int(k * Np/chunks)
println("Starting chunk ", k, ", from ", startindx, " to ", endindx, " with offset ", offset, ".")
pchunk = p[:,startindx:endindx]
prob_func(prob,i,repeat) = remake(prob; u0 = [LV_Vt0, LA_Vt0, pt0sas, qt0sas , pt0sat, qt0sat, pt0svn], p=pchunk[:,i])
ensemble_prob = EnsembleProblem(prob,prob_func=prob_func)
@time " ODE Ensemble solved in " sol = solve(ensemble_prob,GPUTsit5(), reltol=1e-8, abstol=1e-8, EnsembleThreads();saveat = saveat,trajectories=chunksize)
@time " Results processed in " Threads.@threads for i in 1:chunksize
out[1,i+offset] = whatever_measure1(sol[i])
out[2,i+offset] = whatever_measure2(sol[i])
out[3,i+offset] = whatever_measure2(sol[i])
end
end
out
end
# Set up Sobol solution:
chunks:Int64 = 100
N = 20000
lb = [5.0, 1.0, 0.07, 1.75, 1.0, 0.21, 0.360, 0.0, 4.0, 1.0, 0.105, 0.175, 1.0, 0.0315, 0.063, 0.92, 0.0231, 0.042, 0.0, 0.056, 0.0021, 4.34e-5, 0.0, 1.12, 0.035, 0.00119, 0.35, 0.364, 0.0525, 0.0, 14.35]
ub = [5.0, 1.0, 0.13, 3.25, 1.0, 0.34, 0.585, 0.0, 4.0, 1.0, 0.165, 0.325, 1.0, 0.0585, 0.117, 0.92, 0.0429, 0.078, 0.0, 0.104, 0.0039, 8.06e-5, 0.0, 2.08, 0.065, 0.00221, 0.65, 0.676, 0.0975, 0.0, 26.65]
bounds = tuple.(lb,ub)
sampler = SobolSample()
A,B = QuasiMonteCarlo.generate_design_matrices(N, lb, ub, sampler)
# And call it:
Sobol_JS_C = gsa(circ_time_post,Sobol(),A,B,batch = true)
With this I went from 100h to 1h wall-clock-time on a 24-core node. Which is pretty close to estimated based on single ODE solver runs, so scales pretty well.
Ad 2:
I’m not sure if that’s required anymore. I had assumed that the long run-time was somehow related to the Sobol calculations, but it seems it dropped down to single-core for the memory management tasks - which strace confirmed.
With the chunked approach I get the Sobol indices calculated in a few seconds, which is negligibly small relative to the ODE solver runtime.