Dagger not fully utilizing CPU cores

I have a time consuming simulation that is run over a set of signal to noise ratios, and thus is trivial to parallelize. I am currently using Dagger.jl to start 24 processes (I am on a 24 core threadripper system with 256 GB of RAM), and is running the simulations like below:

simnum = 1000
numsnrs = 10
snrs = LinRange(0.0, 40.0, numsnrs)
dtasks = map(enumerate(snrs)) do (n,snr)
    newsp = @set simpar.SNR=snr
    Dagger.@spawn sim_fun(sim, newsp, simnum)
end
errs = fetch.(dtasks)

What happens is the following - if I set numsnrs to 24, it will only use eg. 16 cores, then six and then two. If I set numsnrs to 12, it will first run on ten cores and then two. For numsnrs=10 I see it run on eight, then two cores.

All in all there are plenty of resources on the system. RAM usage is below 20%, and each process that is actually running is at 99-100% and no more, meaning there is no underlying parallelization in eg. FFTs or other low level computations.

Anyone have any ideas on what may be going on here? My simulations are taking at least twice as long as they should, which is not ideal to say the least.

1 Like

have you tried GitHub - JuliaFolds2/OhMyThreads.jl: Simple multithreading in julia

Thanks for the suggestion, but from what I can see it is threads only, and I need to use processes to avoid thrashing the garbage collector (I could probably rewrite parts of my simulation to use preallocated arrays, but for now I just use processes and avoid the worst of the problem)

we have a concurrent marking and parallel sweeping GC now: Multi-Threading · The Julia Language set this to something like 4,1

give it a try

I tried using OhMyThreads, and while it is great and something I will keep at hand for later use, the result was like expected, with 30-40% time spent in garbage collection, versus <10% for a single thread simulation.

Unfortunately there is not much I can do about the memory usage - most of it are return arrays from DSPs xcorr and resample, both of which do not do in-place calculations.

I will dig a little deeper into what Dagger is doing for now, as multiprocessing seems to be the way to go for this particular problem.

So I found a solution that works in my case at least. One can force the task to be executed on a specific worker by using scopes. At least when the number of jobs is smaller than the number of processes, this hack works:

snrs = LinRange(0.0, 30.0, 12)
simnum = 1000
dtasks = map(enumerate(snrs)) do (n,snr)
    newsp = @set simpar.SNR=snr
    scp = Dagger.scope(worker=n+1)
    Dagger.@spawn scope=scp sim_doa(newsp, simnum)
end
@time errs = fetch.(dtasks)
;

I guess an even better approach would be to increase the granularity of the simulation, eg. by splitting each SNR point into multiple smaller samples and then combining them.