I have a numerical calculation (market clearing in a heterogeneous agent overlapping generations model) which is “embarrassingly parallel” in the sense that I can split it up to 10–30 parts that can run separately (the cohorts).
Using a single thread, all parts run for about 0.2s (total). Using 8 threads on my laptop (ThreadTools.jl and ThreadsX.jl, same result), this increases to 5s. This is on Julia 1.5.
So far I have not been able to make an MWE, the code is somewhat large and small examples fail to replicate this. The computation is memory intensive with nonlocal memory access patterns (this is somewhat inevitable, despite best efforts to mitigate it).
Any advice would be appreciated on
- how to benchmark and debug this,
- whether worse performance using threads is something to be expected in these scenarios,
- whether going to a machine with more CPU cache and memory bandwidth could change these results.
I realize that without an MWE this is a vague problem, so I am grateful for vague shots in the dark too on how to deal with this (“just don’t use threads” is an option, so I don’t waste more time on this).