Dagger not fully utilizing CPU cores

Miroboru · August 20, 2025, 7:13am

I have a time consuming simulation that is run over a set of signal to noise ratios, and thus is trivial to parallelize. I am currently using Dagger.jl to start 24 processes (I am on a 24 core threadripper system with 256 GB of RAM), and is running the simulations like below:

simnum = 1000
numsnrs = 10
snrs = LinRange(0.0, 40.0, numsnrs)
dtasks = map(enumerate(snrs)) do (n,snr)
    newsp = @set simpar.SNR=snr
    Dagger.@spawn sim_fun(sim, newsp, simnum)
end
errs = fetch.(dtasks)

What happens is the following - if I set numsnrs to 24, it will only use eg. 16 cores, then six and then two. If I set numsnrs to 12, it will first run on ten cores and then two. For numsnrs=10 I see it run on eight, then two cores.

All in all there are plenty of resources on the system. RAM usage is below 20%, and each process that is actually running is at 99-100% and no more, meaning there is no underlying parallelization in eg. FFTs or other low level computations.

Anyone have any ideas on what may be going on here? My simulations are taking at least twice as long as they should, which is not ideal to say the least.

jling · August 20, 2025, 1:46pm

have you tried GitHub - JuliaFolds2/OhMyThreads.jl: Simple multithreading in julia

Miroboru · August 20, 2025, 2:14pm

Thanks for the suggestion, but from what I can see it is threads only, and I need to use processes to avoid thrashing the garbage collector (I could probably rewrite parts of my simulation to use preallocated arrays, but for now I just use processes and avoid the worst of the problem)

jling · August 20, 2025, 2:33pm

we have a concurrent marking and parallel sweeping GC now: Multi-Threading · The Julia Language set this to something like 4,1

give it a try

Miroboru · August 21, 2025, 10:48am

I tried using OhMyThreads, and while it is great and something I will keep at hand for later use, the result was like expected, with 30-40% time spent in garbage collection, versus <10% for a single thread simulation.

Unfortunately there is not much I can do about the memory usage - most of it are return arrays from DSPs xcorr and resample, both of which do not do in-place calculations.

I will dig a little deeper into what Dagger is doing for now, as multiprocessing seems to be the way to go for this particular problem.

Miroboru · August 21, 2025, 1:05pm

So I found a solution that works in my case at least. One can force the task to be executed on a specific worker by using scopes. At least when the number of jobs is smaller than the number of processes, this hack works:

snrs = LinRange(0.0, 30.0, 12)
simnum = 1000
dtasks = map(enumerate(snrs)) do (n,snr)
    newsp = @set simpar.SNR=snr
    scp = Dagger.scope(worker=n+1)
    Dagger.@spawn scope=scp sim_doa(newsp, simnum)
end
@time errs = fetch.(dtasks)
;

I guess an even better approach would be to increase the granularity of the simulation, eg. by splitting each SNR point into multiple smaller samples and then combining them.

Topic		Replies	Views
Dagger, No speed increase, parallel computing General Usage	5	1772	January 26, 2018
Not seeing speed-up with Dagger.jl. Am I doing something wrong? Julia at Scale	8	676	December 14, 2023
Parallel Good Practice Julia at Scale	22	4046	November 30, 2018
Dagger error, parallel computing General Usage package	2	976	January 18, 2018
How to setup Julia workers to use specific number of cores? Performance	5	4574	January 21, 2020

Dagger not fully utilizing CPU cores

Related topics