Global sensitivity analysis in a large parameter space - computational cost

Hi everyone,

I have been working in a large biochemical reaction network and I try to perform Global sensitivity analysis (GSA) (see my last post). I was wondering if there are any advice, or suggestions regarding the efficiency of GSA when the parameter space is large. For example, in my case, I have around 120 parameters. What would be appropriate sampling and would be the most efficient method for such a large of parameters?

Thank you for your time and effort!

I think a combination of using batching as described here Parallelized Morris and Sobol Sensitivity Analysis of an ODE · GlobalSensitivity.jl and the eFAST method should be ideal for such a case

eFAST with batching and setup the batches to be handled via GPU parallelism if possible.

I tried eFAST but I get this error:

DivideError: integer division error

The Sobol method does not give an error but it is slow.

eFAST gives the following error, while Sobol does not, but it is slow.

DivideError: integer division error

Can you share an MWE for us to look at with the eFAST error?

Yes, Sobol is robust but slow. That’s pretty much by definition of the method :sweat_smile:

Can I share it with you on the slack channel?

Yes please, post it in either sciml-bridged or diffeq-bridged channels

I’m highjacking this thread since we have a very similar problem and I don’t want to split the discussion up too much.

We run Sobol analysis on a large-ish ODE-model in parallel on a 24 core machine. The batch processing of the individual ODE solutions works great and scales wonderfully, but we found two issues/bottlenecks:

  1. Memory usage goes up massively. I assume that’s because all the results get stored in memory for the Sobol index calculation to work on them most efficiently. For the moment I manage to get the case in 128G or 256G RAM with ZRAM compression (which doesn’t seem to cost too much overhead interestingly).

  2. Only the ODE solvers run in parallel batch mode. On our models I found that thats about 80-90 percent of the workload, while 10-20% seems to be the Sobol index calculation. In practice that means that on a typical small problem on my laptop I have a 1:1 ratio between parallel and serial wall-clock time (toy test case just clocked up 13.6 CPU-minutes in 7 wall-clock minutes: 50 seconds wct parallel on 8 cores, and then another 7 minutes on a single thread. On the big machines, on a bigger problem, I had 1h wct on 24 cores and expect another 24h wct on a single core to get the estimated 48h CPU time (currently at 28h CPU time - so we’ll see tomorrow how good my estimates are and if it all scales as I naively expect).

Ad 2.: is there a way to run the Sobol analysis in parallel as well?

Ad 1.: is there a way to write the results object to disk and then have the Sobol analysis read that back in while it calculates the indices? I don’t know the algorithm, so don’t know if it is possible to sequentialise that enough to make that work. Or do I need to ask for a 1TB RAM machine?

Open an issue. It can be done, it’s not implemented. It could be mmaped.

That could be done. That would need an issue and some work in the package.

1 Like

Ad 1:

I tested this a bit more and it seems that the memory allocation and deallocation adds considerable overhead (the whole GSA process allocated around 200GB overall). So I have changed from solving a single Ensemble to solving chunks of the parameter space at a time and reassembling the output array afterwards (code is not yet cleaned up but does work).

# Define GSA solver function:

ensemble_chunked = function (p)

  global parameterset = p

  Np::Int64 = size(p,2)

  println("Running ", Np, " cases.")

  chunksize::Int64 = Np/chunks

  println("Using ", chunks, " chunk(s) of size ", chunksize, ".")


  out = zeros(3,Np)

  for k in 1:chunks
      offset = Int((k - 1) * Np/chunks)
      startindx = Int(offset + 1)
      endindx = Int(k * Np/chunks)

      println("Starting chunk ", k, ", from ", startindx, " to ", endindx, " with offset ", offset, ".")

      pchunk = p[:,startindx:endindx]

      prob_func(prob,i,repeat) = remake(prob; u0 = [LV_Vt0, LA_Vt0, pt0sas, qt0sas , pt0sat, qt0sat, pt0svn], p=pchunk[:,i])

      ensemble_prob = EnsembleProblem(prob,prob_func=prob_func)

      @time "    ODE Ensemble solved in " sol = solve(ensemble_prob,GPUTsit5(), reltol=1e-8, abstol=1e-8, EnsembleThreads();saveat = saveat,trajectories=chunksize)

      @time "    Results processed in " Threads.@threads for i in 1:chunksize
          out[1,i+offset] = whatever_measure1(sol[i])
          out[2,i+offset] = whatever_measure2(sol[i])
          out[3,i+offset] = whatever_measure2(sol[i])
      end
  end
  out
end

# Set up Sobol solution:

chunks:Int64 = 100

N = 20000

lb = [5.0, 1.0, 0.07, 1.75, 1.0, 0.21, 0.360, 0.0, 4.0, 1.0, 0.105, 0.175, 1.0, 0.0315, 0.063, 0.92, 0.0231, 0.042, 0.0, 0.056, 0.0021, 4.34e-5, 0.0, 1.12, 0.035, 0.00119, 0.35, 0.364, 0.0525, 0.0, 14.35]
ub = [5.0, 1.0, 0.13, 3.25, 1.0, 0.34, 0.585, 0.0, 4.0, 1.0, 0.165, 0.325, 1.0, 0.0585, 0.117, 0.92, 0.0429, 0.078, 0.0, 0.104, 0.0039, 8.06e-5, 0.0, 2.08, 0.065, 0.00221, 0.65, 0.676, 0.0975, 0.0, 26.65]
bounds = tuple.(lb,ub)
sampler = SobolSample()
A,B = QuasiMonteCarlo.generate_design_matrices(N, lb, ub, sampler)

# And call it:

Sobol_JS_C = gsa(circ_time_post,Sobol(),A,B,batch = true)

With this I went from 100h to 1h wall-clock-time on a 24-core node. Which is pretty close to estimated based on single ODE solver runs, so scales pretty well.

Ad 2:

I’m not sure if that’s required anymore. I had assumed that the long run-time was somehow related to the Sobol calculations, but it seems it dropped down to single-core for the memory management tasks - which strace confirmed.

With the chunked approach I get the Sobol indices calculated in a few seconds, which is negligibly small relative to the ODE solver runtime.

1 Like

Just tested this a bit more by reducing the number of trajectories so the whole GSA fits in less than 1/2 the memory.

2500 trajectories on 24 threads (12 cores),

SIngle chunk: 730 seconds for the Sobol analysis

100 chunks: 557 seconds for the same analysis

That’s for an ODAE system 8+1 equations, 31 parameters. Solution time for the baseline parameter set: 82 ms.

So that’s 33*2500 = 82500 runs of the ODE system, which should take 6765 seconds.

Perfect scaling on 24 threads would give 282 seconds.

I’ll see what happens for smaller chunks (just started a 500 chunk run**) and if I get convergence :slight_smile:

But so far it seems that even without memory over-commit (I switched all swap off for this test) many smaller Ensembles seem to scale better than one big one. At least in this case, where the solution object needs to keep hundreds of thousands of ODE solutions in memory.

I get 37% gc time on the full ensemble run vs 12.7% gc on the chunked run*.

* I’m not sure if I can trust the % calcs, though, since the compile times seem to be in CPU time vs wall-clock time: I get values like 1500% compile time on first runs :slight_smile:

** Edit: 500 chunks did only speed it up very slightly. gc time dropped to 10% and run-time to 568 seconds.

I also want to mention that faced the same problem regarding memory usage. I was running the analysis in a cluster but the job exited because of the large usage of memory. Should we open an issue?

Yes open an issue. It can get mmaped.

Thanks. I opened two different issues regarding Memory-mapping and parallelization.

Came here from the future to thank for this code example, it is a lifesaver. Tangentially related, I am having the following problem in a large parameter space (maybe I should open up another thread for this). Let’s say I have two parameters A and B. If I constrain the range of A between 0 and 10 and B between 0 and 5, the sensitivity to A turns out higher. If I do the inverse, sensitivity to B is higher. If I make both of them between 0 and 10, then the solutions to differential equation diverge. I tried putting NaNs to divergent solutions but this impairs the subsequent calculations of sensitivity analysis since the software is not optimized for dealing with NaN values. I can create a minimum working example if needed. Does anybody dealt with a similar problem? I am open for any suggestions.