Parallelized calls to Optim.optimize use the same number of threads as a single threaded call

To investigate how sample size affects uncertainty in parameter inference, I do something like:

Θ_example = something  #Θ is a vector of the parameters I'm trying to infer
Θvec = zeros(numSamples, size(Θ_example))

for i in 1:numSamples
	s = generateSample()
	Θ = inferParameters(s)
	Θvec[i] = Θ
end

where inferParameters calls on Optim.optimize and currently uses NelderMead()

When I run this on my MacBook Pro with an M1 Pro chip with numThreads=8 I can see that I use one core only and have ~22 threads. I see this on both in the terminal using top and using the Mac’s Activity Monitor.

Naively, I’d like do something like this instead:

Θ_example = something  #Θ is a vector of the parameters I'm trying to infer
Θvec = zeros(numSamples, size(Θ_example))

Threads.@threads for i i in 1:numSamples
	s = generateSample()
	Θ = inferParameters(s)
	Θvec[i] = Θ
end

When I parallelize in this way I do see all my cores in use but it’s still the case that I’m only using ~22 threads spread over the 8 cores and my computation is significantly slower than that single-threaded version.

Finally, my question:
Is this behavior expected and is there a better way to utilize the multiple cores? I would have expected / hoped to have 20-odd threads running on each of the 8 cores rather than the same total number of threads as the single-core case.

Unlike its name, Threads.@threads does not spawn new OS threads, but Julia Tasks, that are run on the OS threads julia was spawned with (through e.g. -t 8). If your code is using a BLAS library or similar under the hood, those threads are distinct from the threads julia runs with. So in essence, your solver likely is already internally threaded, and using parallelism on top is not going to meaningfully improve performance, but rather increase contention.

3 Likes

But in the first implementation in which I don’t use Threads.@threads it is the case that all 22 of the threads (presumably from BLAS, as you say) are on a single core. At least I think so… Here’s a snapshot of top


Is there a way to parallelize in which each optimization with its own set of 22ish threads lives on a different core?

Edit: I think I misinterpreted the output of top. man top says that the threads column shows “Number of threads (total/running)” rather than total / number of cores as I thought. Not sure that I believe that because it continues to display 22/1 as in the screenshot but displays 23/8 when I run the version with Threads.@threads

You can try Dagger.jl

So @threads indeed is utilizing more “threads”.

Note that julia has some defaults if you don’t supply -p and -t arguments

I’ll have a look, thanks.