I have a situation like below:
Julia
nrep = 3
Threads.@threads for i in 1:nrep
run(`bash -c """cd $path && ./external_program"""`)
end
external_program.cpp
#pragma omp parallel for num_threads(10)
for (int i = 0; i < 10; i++){
// do stuff
}
Also, my cpu has 32 cores (64 threads). Julia is running with 32 threads.
Everything works fine, but I get way more overhead if nrep*num_threads > 32 than when running nested parallel loops all from julia.
I’m not sure what exactly is going on in the background, but I’m guessing the latter case is composable while my problem is not.
Is that understanding correct? Is there a way to address this?