I observe a strange problem. I am doing Monte Carlo, so for now I start multiple julia processes from the bash, each of which gathers statistics.
Running 4 processes in parallel I see a 4 times slow down in execution of each of them compared to when I start only a single process.
Processes use BLAS.gemv! for dense matvec product, and Profiling in both cases show that the time spent in BLAS.gemv! is increased significantly when running multiple processes, so my guess it is the primary problem here. But I can’t understand why it happens.
Each process does BLAS.set_num_threads(1) at start so there should be no problem of too many threads used.
Does anyone have ideas about this?