Running several Julia engines

Not sure I am using correct terms/language to describe the problem, but I will try.

I want to run several Julia processes on one PC (Monte Carlo simulations). When I do it in Matlab, I simply start several Labs, and the loss in the speed of computation in moving from running one lab to 8 labs is about 30% (approximately, naked eye, based on some tick toks but not two markov chains are identical in speed)
When I do the same with Julia, the loss in speed from moving 1 to 8 engines is more substantial, I would say 60% or more, wiping out nearly the whole gain in moving from Matlab to Julia.
I am not using any fancy external dependencies, very simple code, using Julia 1.2 + several packages like Random, DelimitedFiles etc.

So the question is: may be there are some setting (in Atom?), language features (?) etc which will reduce this loss?

If relevant, it is AMD 3950x.

PS. yes, there is a similar thread, Slow down when running several parallel julia processes which use BLAS (MWE is provided) but I understand very little there, not sure I have the same reason, as I know nothing of BLAS.gemv! mentioned there.

I think this might be use of some Linear Algebra routines. If so, you need to lower OpenBlas threads. Each process separately runs OpenBlas routines if you use any BLAS or LinearAlgebra method and each of them wants to use all cores of the CPU via OpenBlas threads. Then, they are competing for computational resources, in a sense, causing context switching in operating system more. That slows down overall computations. Try BLAS.set_num_threads(n) before the MC simulations code where n = (# of CPU cores) / (# of Julia processes) = 16/8 = 2 (for AMD 3950x).

1 Like

Thank you, I am investigating this. Will report eventually what I discover: how using this setting affects the speed.

Were you able to solve the problem?

I am the author of the thread Slow down when running several parallel julia processes which use BLAS (MWE is provided).
In that example, each process was doing many matrix-vector product operations. The problem was that all the matrices combined could not fit into cash. As a result, each time a process tried to acess matrix elements it caused cash reload.

Eventually, I dealt with the problem simply by making the matrix shared between processes (using SharedArrays from stdlib).
Do you start julia processess independently? It may be more beneficial to use Distributed stdlib.

Dear Grigory

It is interesting. How do you know that they do not fit into cash? I do have matrices of course, not sure though whether it is sufficient to generate the problem. I was simply looking at the task manager where it says that all Atom-related processes take say 20% of memory (I have 64RAM) so thought this was nothing. I will see if I can do shared arrays.

So I do not think I resolved the issue. I used that command for LinearAlgebra, not sure it changed anything. I simply ended up with running 4 MATLAB labs and 4 Julia processes on this mashine, because if I increase the number of Julia processes it slows down substantially, and if I experiment - killing MATLAB once again and starting 8 Julias with different settings - I waste hours. 4 Julia are slower than one (given 4 to 8 MATLAB running at the background - this fact is not really noticed with 64RAM) but still huge gain relative to MATLAB. Once Julia chains are close to finish, I swap chains, so overall the whole thing is faster. It is the only one such big thing I have (with computationally intensive and slowly converging object inside) so I need it only once (so far). But if I need to do it again, I then really need to do something about it.

Do you mean 64 GB?

I am certainly not a specialist but I will try to explain what I understand about the subject of cashes.
Processor cashes are a separate entity from RAM.
The speed of the cash is much higher than the speed of RAM, but the size of the cash is much smaller. When processor accesses some point in memory which is not currently in the cash, it downloads whole neighbourhood of the memory point into cash.

In reality, there are several levels of cashes. However, importang thing is that the last-level cash (last before RAM) is shared between all the cores of the processor. Your AMD 3590x should have 64MB last-level cash. (If you use Linux, you should check the output of lscpu command).

You could estimate the size of all the matrices from different processes and compare it with the size of the cash.

If you have some matrices which are common for all the processes, you should probably make them SharedArrays. Also, check the docs on Parallel Computing, especially the part on Multi Core or Distributed Processing.

Sorry, I am blond and clearly dyslexic. Yes 64GB. Thanks a lot for shared arrays - I will see if I can indeed do something here.