Parallel Good Practice

Hi,

I have quite a naive question regarding parallelism in Julia.

Let’s say I have a machine with 4 CPUs. I want to fully use my machine and I can easily use 4 workers in parallel for my problem. The manual says:
" Let’s try this out. Starting with julia -p n provides n worker processes on the local machine. Generally it makes sense for n to equal the number of CPU threads (logical cores) on the machine"

So, if I follow the manual, I would start Julia with julia -p 4.
I would have access to 4 workers (workers 2 to 5).

@sync for p in workers()
  @async @fetchfrom p my_function(...)
end

But then, what happens to process 1? Does it stay iddle while workers 2 to 5 are working ?

Would it be better to start Julia with julia -p 3, and to use my four processes ?

@sync for p =1:nprocs()
  @async @fetchfrom p my_function(...)
end

What are the pros of using the first option over the second one ?

Thanks a lot for your help.

2 Likes

I don’t think there is a correct answer to this. It depends! :smile:
If you are not at all memory constrained, then it may make sense to request 4 cpus (or even more!). I don’t know the details enough, but it’s my impression that one logical core can execute code for more than one worker process. So, if process 1 only has to do a negligible amount of work, this is going to run on 1 of your 4 cpus, next to the workload from my_function. So, again: the real constraint is often memory (each process will allocate memory in general), so I often end up running far fewer processes than I have cores. you just try out what works best (and on which machine, if you have access to several).

why do you use the @async and not just a pmap?

hth

1 Like

Hi,

Thanks for your answer.

I am not using pmap because it would result in weird behavior for my problem. I was starting 16 jobs on 16 workers, and some workers would take much longer than other ones (event though each task takes approximately the same time when run in serial). When using @sync for @async, this problem disappeared.

So overall, there is no rational for using only “workers” to do the calculations and having process 1 as a coordinator ? It seems like many Julia functions assume this particular structure (for instance, pmap uses “workers”, but not process 1 when other processes are available).

Just as a clarification, I can perfectly use Julia with 5 processes on a machine with 4 CPUs, without having any mutlithreading involved ?

Many thanks.

Yes.

Just checking, do you have 16 cores?
If you have only 4 cores (as in your example), why not use 16 jobs on 4 workers?

1 Like

Hi,

When I was using 16 jobs on 16 workers, I was working on a machine with 16 vCPUs. The “16 jobs - 16 workers” is not related to the example I gave above, in which I am considering a machine with 4 CPUs (my local machine).

2 Likes

@floswald You said that one logical core can execute code for more than one worker process.
That is indeed correct - it is called Hyperthreading by Intel. A CPU core appears as two logical cores to the OS since it has two or maybe four (Xeon Phi) sets of registers pipelines - one for each process.

So, as you say, if one process stalls then another process may be usefully run on that physical CPU core, keeping up overall efficiency.
The physical core can switch rapidly between the registers.

I must say that HT has traditionally been seen to have no improvement for HPC type codes - it had advantages when processes have stalls - eg on a desktop when waiting for keypresses.
Indeed HT may be enabled or disabled in the BIOS and many HPC systems have HT disabled.

I apologies as I use the term process here loosely.
Also I am not sure if register is the technically correct term - I mean program counter and stack.

@johnh, does it mean that when the manual says:
“Generally it makes sense for n to equal the number of CPU threads (logical cores) on the machine” (Parallel Computing · The Julia Language), it implicitly assumes that the user has access to a machine with hyper-threading enabled (Julia will start with n+1 processes)?

1 Like

That is a good question! Note the use of the term logical cores
Logical cores are they hyperthreaded cores. So there may be 2 physical cores on a machine. Each core has 2 hyperthreads so there are 4 logical cores.
The OS is not aware of the difference between logical and physical cores.
However it is rather easy to tell which is which using lscpu on Linux or cat /proc/cpuinfo and look for the Siblings of a core. Windows - err I don’t know.

Note - there may be more than 2 Hypethreaded cores per physical core. It depends on the CPU architecture.

One other thing if you are interested in hyperthreading and benchmarking. As I said above you can turn hyperthreading on and off in the BIOS. You can do the equivalent without rebooting a machine by switching cpu cores on and offline - I can do this in Linux, I am sure it is possible in MacOS and Windows.

A Core i3 with Hyper-Threading can process two threads per core which means a total of four threads can run simultaneously. The current Core i5 range of desktop processors doesn’t have Hyper-Threading so can also only process four cores. i7 processors do have it, so can process eight threads at once.Apr 6, 2018
Intel Core i3, i5 and i7: Which Do You Need? - Tech Advisor
https://www.techadvisor.co.uk › what-is-t…

1 Like

Hi,

I have been experimenting with a simple example quite a bit.

Overall, the “master-slave” paradigm in which process 1 is just a coordinator seems to make more sense, as illustrated by the following toy example.

In the code below, I minimize the Rosenbrock function several times in parallel.

The following code takes 4.018 s when starting Julia with 5 processes ( julia -p 4):

using BenchmarkTools
@everywhere using Optim
@everywhere f(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2

@everywhere function rosenbrock(x)
  return( sum( 100*( x[2:end] - x[1:end-1].^2 ).^2 + ( x[1:end-1] - 1 ).^2 ) )
end

@everywhere function my_function()

  x0 = zeros(50)
  optimize(rosenbrock, x0, Optim.Options(iterations = 1000000))

end

# Using workers
#------------------
@everywhere function using_workers()
  @sync for w in workers()
    @async @fetchfrom w my_function()
  end
end
@btime using_workers()
#-----------
# 4.018 s (609 allocations: 42.86 KiB)

When I time the following function in a Julia session with 4 processes (julia -p 3), it takes 7.479 s:

# Process 1 involved in calculations as well
#-----------------------------------------------------
@everywhere function using_processes()
  @sync for p in procs()
    @async @fetchfrom p my_function()
  end
end

@btime using_processes()
#-----------
# 7.479 s (12187276 allocations: 3.31 GiB)

The issue with the function using_processes() is that you cannot “@async” process 1. In the first iteration of the for loop, process 1 will start my_function(), wait until the minimization is over and then start the minimization in parallel on processes 2 to 4.

To rescue the version in which process 1 coordinates and calculates, one has to first use process 1 to send calculations to other processes and then fetch the results, making sure to fetch process 1 last:

@everywhere function using_processes_2()

  # SPAWNING
  #---------
  results = []
  @sync for p in reverse(procs())
    push!(results, @async @spawnat p my_function())
  end

  # FETCHING
  # Making sure to hit process 1 at the end
  #----------------------------------------
  @sync for i = 1:nprocs()
    @async fetch(results[i])
  end

end


@btime using_processes_2()
# second run
#-----------
# 4.129 s (12186861 allocations: 3.31 GiB)

2 Likes

Yes, this is what pmap does.

Can you share the details?
I’m surprised and a little dismayed that pmap is not as efficient as your alternative.
Perhaps it’s a performance bug with pmap?

Running your rosenbrock example on my machine gives similar timings:

addprocs()

@everywhere using Optim

@everywhere function rosenbrock(x)
    return( sum( 100*( x[2:end] - x[1:end-1].^2 ).^2 + ( x[1:end-1] - 1 ).^2 ) )
end

@everywhere function my_function(worker::Int)
    x0 = zeros(50)
    optimize(rosenbrock, x0, Optim.Options(iterations = 1000000))
end

function using_workers()
    @sync for w in workers()
        @async @fetchfrom w my_function(w)
    end
end

function using_pmap()
    pmap(my_function, workers())
end

using BenchmarkTools
@btime using_workers()
@btime using_pmap()

1 Like

A question related to this thread: I have access to a server which has 8 Intel® Xeon® Processors E5-2697 v4. The datasheet says

# of Cores 18
# of Threads 36

My workload is “embarrassingly parallel” (MCMC, the calculation at some point breaks up into about 200 functions I could call in parallel, then sum the resulting log likelihoods, plan to use pmap).

So in theory I can use 8 x 36 = 256 threads, ie start with

julia -p 256

? I will obviously experiment and benchmark, but never worked with a machine like this and I am looking for a reasonable starting point.

That is a nice machine! The rather ancient received wisdom in HPC is that hyperthreading does not do you much good. This is OF COURSE code dependent. Hyperthreading offers throughput improvements when threads stall, and another one can take the compute resource.

@tamas_pass There is a fantastic tool called hwloc which will explore they layout of your server, where the CPU cores, caches and the I/O cards are located.
https://www.open-mpi.org/projects/hwloc/
This is the latest version, though you should just be able to do apt-get install hwloc hwloc-gui
Then run the lstopo utility. There are other utilities - I just found hwloc-pc which looks very useful.

In HPC process pinning is important - ie. you use the NUMA tools to pin processes to a set of CPUs.
This means you dont have processes moving around, which causes cache flushes.
Also processes nearby to each other means less memory fetches from further away memory banks.
It is going to be interesting to see how Julia then behaves on a many socket machien like this.
Can we do some comparisions of allowing the kernel scheduler to move processes around, versus having them pinned?

HEre is what my laptop looks like with lstopo

lstopo

I am seeing 100% cpu utilization (in top, so I guess it is for a single CPU), so I am wondering whether to go for 8 or 8x18 threads then. Will benchmark and report back here.

I have benchmarked my PDE code on a multi-socket system with mpirun --bind-to core on OpenMPI, and it didn’t make much difference. I think the scheduler is smart enough to not move processes around too often.

Is it 36 threads per core, or rather 2 threads per core (36 threads total)?

I was assuming the former, ie that that each of the 8 items in /proc/cpuinfo is a different CPU and the specs I linked above apply individually, but I have little experience in this.