How to make the double parallelization in the CMAEvolutionStrategy.jl?

Hi,
I’m wondering whether it’s possible to combine parallelization inside the cost function with the parallel evaluation across the CMA-ES population (CMAEvolutionStrategy.jl).
Here’s my situation:
I’m optimizing model parameters using CMA-ES. Inside my cost function, the model performs 17 independent site-level runs (each using the same parameter vector), and then aggregates their results into a single scalar loss. So, there are two potential levels of parallelism:

  1. Across population members (already supported via parallel_evaluation = true), and
  2. Within each cost function evaluation, across the 17 model runs.

Currently, the CMA-ES CMAEvolutionStrategy.jl package parallelizes across the population. However, it would be much faster if I could exploit both levels of parallelism simultaneously — e.g., multithreading (or distributed) across population members and, within each, further parallelize over my 17 runs.
Here’s a simplified example illustrating the idea:

function single_model(x, input)
    n = length(x)
    return sum(100 * (x[2i-1]^2 - x[2i])^2 + (x[2i-1] - 1)^2 for i in 1:div(n, 2)) + sum(input)
end

function my_model(x, total_input)
    results = Vector{Float64}(undef, length(total_input))
    for (i, input) in enumerate(total_input)
        y = single_model(x, input)
        results[i] = y
    end
    return sum(results)
end

Possible @threads parallelization of this model running can be done like

function my_model(x, total_input)
    results = Vector{Float64}(undef, length(total_input))
    Threads.@threads for (i, input) in enumerate(total_input)
        y = single_model(x, input)
        results[i] = y
    end
    return sum(results)
end

Right now I am using this way to parallelize across the population size ():

population_size = Threads.nthreads()
parameter_vector = Vector{Float64}(undef, population_size)

or

population_size = 4 + floor(Int, 3 * log(length(parameter_vector)))
parameter_vector = Vector{Float64}(undef, population_size)

and feed parameter_vector into CMAES optimizer by If parallel_evaluation = true, the objective function f receives matrices of n rows (n = length(x0)) and popsize columns and should return a vector of length popsize. from GitHub - jbrea/CMAEvolutionStrategy.jl: A julia implementation of the CMA Evolution Strategy for derivative-free optimization of potentially non-linear, non-convex or noisy functions over continuous domains..
You can see that the for loop in my_model (over total_input) represents the 17 independent runs. I’m wondering if there’s a clean way to parallelize both this inner loop and the CMA-ES population evaluations at the same time, without the two interfering with each other.
Any guidance or example would be greatly appreciated!
Thanks a lot!

I don’t see any threading-related code in your post, so I can’t give specific advice. However Julia is rather unique in that it features cooperative multithreading to make threading composable. See more information in this blog post:

In essence, I think you can just multithread the inner computation as well and Julia should handle that in an efficient manner. Usually, from my experience it is best to parallelize on the highest level though (assuming the resulting task take roughly equal time), so I am not sure why you hope for significant gains in your scenario.

I just edited the post to include the example if I want to make the parallelization inside my_model

Regarding “Usually, from my experience it is best to parallelize on the highest level though (assuming the resulting task take roughly equal time), so I am not sure why you hope for significant gains in your scenario.”, this is because there are two ways of simply running which can be reduced for sake of time:

  1. for loop inside the my_model because I have exactly same code to run the single model, with just different inputs.
  2. CMAES population. CMAES decides the optimization directions by using statistics of population (perturb the parameter x then run my_model several times).

Right now I am able to do the CMAES population parallelization for single_model. For my_model (the one with multiple for-loop of single_model), I don’t know how to do it…

Please let me know if I am clear…

or simply if you could imagine that I want to optimize a sophisticated model which supports parallelization like MPI but I don’t how to combine the outer population parallelization with the inner MPI parallelization…

Somehow I am still a bit confused. Let me try to rephrase: You have

  1. This CMA-ES population thing (sorry I am not familiar with it). I think this is some form of optimizer, so it will evaluate some objective/loss function many times.
  2. The objective/loss function. This is the function you called my_model that does these 17 independent calculations and then combines them into one.

From the Readme of the package you linked I got:

  If parallel_evaluation = true, the objective function f receives matrices of n rows (n =
  length(x0)) and popsize columns and should return a vector of length popsize. To use
  multi-threaded parallel evaluation of the objective function, set multi_threading = true
  and start julia with multiple threads (c.f. julia manual for the multi-threading
  setup).

So I would try first to set multi_threading = true (I am not quite sure if you need to set parallel_evaluation = true as well or not). This parallelizes the outer layer. Check if this uses the available resources (i.e. fully uses your CPU cores). Note that you need to check whether Julia was started with sufficient threads (e.g. Threads.num_threads() should equal the number of CPU cores). If not start Julia with julia -t <num_threads>.

If the previous, i.e. the outer parallelization, does not exhaust your resources, then I would try to also parallelize the inner calculation, i.e. my_model, in basically the way you showed with Threads.@threads. Since Julia’s multi-threading model is cooperative, Julia prioritizes the inner calculations when scheduling the tasks. This should increase CPU utilization. If you have seen that there is potential for this, then maybe it can be optimized further, e.g. by varying how many tasks are used in my_model but I would expect this to have rather weak effects over all.

Maybe one question: Is it worth using multithreading for my_model? How long does a call to my_model and single_model take typically? If single_model takes less than ~20μs then we need to be a bit careful with applying the parallelization because task creation/scheduling has an overhead of ~10μs IIRC. This just means we need to make sure that we chunk the 17 model runs not too fine but we can discuss that if it is necessary.

Thanks for your reply! @abraemer

Yes, your rephrasing is correct!

For the question " Is it worth using multithreading for my_model ?" the answer is yes. Because my_model runs 17 independent single_model which runs about 20 seconds..(best case) so if without parallelization, my_model would run to 17*20 seconds…