Baffling addprocs() with @everywhere


#1

Can addprocs() function only be used at the start of code?

@everywhere function samplefun(x::Real) 2.0 end

addprocs(2)   ## if here, it does not work

@everywhere M= 10

sa = SharedArray{Float64}(M)

s= @sync @parallel for i=1:M; sa[i] = samplefun(M); end#for

info( sum(sa) )

if this is so, it would be nice for the compiler to warn the user about subsequent @everywhere. this one took a while to isolate.


#2

I assume the problem was that samplefun wasn’t defined on the two procs you added?

Basically, you need to define the function on each process you want to use said function on. Running function samplefun(x::Real) 2.0 end everywhere means running it on all processes. You can’t run something on a process that doesn’t exist yet.


#3

thank you, elrod. indeed.

I do not know how @everywhere works—just that it somehow tells Julia that the function needs to become “globally” accessible by other procs. I thought @everywhere just tagged a function for “outsourcing” later on when the processes spring into action (or added it to a global-type namespace or…). but from your explanation, it somehow transfers it to all already existing processors at the time. this explains my problem.

related question—say my master reads a function from a file or the terminal (say a string containing s= function(x) sum(sin.(x))). the master now wants to pass this string s with its contents on to its workers for parallel processing. so, the master would first addproc() to create its workers, then somehow stick @everywhere in front of s or the the function source, and/or compile the function, and/or pass it on to the workers?! does someone have a simple example for this?

and how one converts an existing plain function into an @everywhere function?!

another related question. I need to get my terminology straight. on my CPU/OS, I am comfortable. on Julia there are cores, threads, processes, and tasks (@async?), not to speak of channels, remotechannels, etc. I think Julia uses variants of the jargon I know from my OS. could someone clarify this here? I think julia calls a “process” what my CPU calls a “thread”—that is, a process in Julia is a pseudo cpu core. a task is what @async creates, but tasks always run on the current process (i.e., CPU core, usually master). etc. this could all be wrong. is there a glossary somewhere that lays this all out?

advice appreciated. /iaw


#4

Processes are different from threads. Open a task manager, and you’ll see a list of running processes. If you addprocs(8), there will be 1+8 Julia processes running, alongside your internet browser, preferred IDE, and whatever else you have running.
They each have their own, independent memory, If you want one of them to do something, you’ll have to tell them. An exception is that global variables defined in Main (ie, the module you’re in when you start the REPL) will transfer automatically if referenced in remotecalls, although that isn’t recommended.
https://docs.julialang.org/en/stable/manual/parallel-computing

Processes normally do 1 thing at a time (running at 100% of a CPU core), so making more processes is an easy way around that.
Another strategy is to start Julia with more than 1 thread. export JULIA_NUM_THREADS=8 will start Julia with 8 threads. Julia will then consist of only a single process instead of 8, and all threads will share the same memory. Define a function, and they’re all aware.
Running things on more than one thread (eg, split up a for loop via @threads) will let them each work.
Threading tends to have less overhead, and thus to be faster.
The difference you’d see in the process monitor is that when you have 8 processes/threads working full blast with addprocs, you’d see 8 processes each running at 100%, while with threading you’d see a single process running at 800%.

So, given that each process is a semi-separate child of the parent/master process, just think that @everywhere expression runs the expression on all available processes (everywhere), as if you had a bunch of different Julia sessions open, and ran that line in each of them.
If you afterward decide to add 2 more Julia sessions to your roster, they’ll be “fresh”, brand new processes. You’ll have to go back and run things on them; they’re not going to automatically know about stuff you ran on other processes before they were born.
Of course, adding processes is a lot more convenient than starting a bunch of sessions if you want to either run similar things on each, and transfer data between them.

If you want to include a file that has a bunch of Julia functions everywhere, just @everywhere include("path/to/filename.jl").
Or, @everywhere using ModuleWithFunctions if you have them in a module on your path.


#5

thanks, elrod.

so, if I have an i7 with 4 cores (8 threads), hardware-wise, and I am running on just one computer, will threads take advantage of the i7’s all 8 CPU threads ability, too?

I did not even know that one os CPU process can operate many threads under linux/macos. I have only ever seen multitasking via separate os CPU processes.

/iaw


#6

Yes.
If you launch Julia from a terminal, just

$ export JULIA_NUM_THREADS=8
$ julia -03

Three export is only good for that terminal tab, so you’d have to launch Julia from the same place.

Juno/Atom does this automatically, just check your Julia settings.
No support in VSCode yet.
I think if you launch IJulia from a Julia session with 8 threads, you’ll have that many in the IJulia notebook too, but I didn’t test.

To use more than one thread, it’s just

Threads.@threads for i in 1:n 
    Stuff
end  

But be careful of the closure bug (can get around via let blocks, or accessing thibgs through refs / fields of a mutable struct – worst case scenario, have those refs or mutable struct global constants), and of false sharing.
I recall some discussions where cache sizes were a problem, but hopefully you don’t have to worry about anything so low level.

You should get around 4x speed, maybe slightly more, from using 8 threads. 4 will probably get you just under 4x.


#7

sorry, elrod. I am still strugging with @everywhere use.

for m=1:10
    addprocs(1)   ## everywhere's must come *after*
    @everywhere	M=m
    @everywhere	f(inme) = inme^2
    ## run my stuff
end

I want to turn existing objects into @everywhere. for example, m. (could also be another function g, instead.)

ERROR: LoadError: On worker 2:
UndefVarError: m not defined

can this be done? or should I think of number of processors more like a static unchangeable feature, just like number of threads? (it is surprising to me that there is an addproc() but not an addthreads().)

regards,

/iaw


#8

Why do you want to add procs in a for loop?

julia> addprocs(6);

julia> @everywhere for m = 1:6
           global M = m
       end

julia> @everywhere return_m() = M

julia> remotecall_fetch(return_m, 2)
6

julia> remote_do(() -> (global M = 4), 4)

julia> remotecall_fetch(return_m, 4)
4

It is a bad idea to use global variables like that. This is just an example.


#9

writing benchmarks that compare performance with different number of processors.

and for understanding how this all works. after all, I need to have a modest understanding for the sake of at http://julia.cookbook.tips/doku.php?id=parallel … (PS: thanking you at the end of the chapter.)

With threads, I can get almost linear speedup, but it’s somewhat inconsistent based on number of threads and cores. It can even go up when mustering more cores.

alas, what is the closure bug??


#10

The closure bug: https://github.com/JuliaLang/julia/issues/15276
Sometimes type inference fails on variables caught in a closure. The @threads macro creates a few variables, and captures some in a closure. On 0.7, it automatically creates a let block which helps in most cases.
It doesn’t come up most of the time, but often enough that you should be aware of it. If you suddenly get worse performance, that’s the first place I’d look (@code_warntype and using Traceur into @trace should find the resulting type instabilities).

ChrisRackauckas blogged about the optimal number of workers:


On a 4-core 8-thread i7 (like what you have), he actually saw the fastest speed with 15 workers.
He explains things more in depth there, but the gist is that the busier your cores, the faster it’ll crunch numbers.
Unfortunately, if the next thing they have to compute on isn’t in a cache close to the compute units, they will have to wait for the data to move there.
If you have extra worker processes, the core could just switch to one of those extras and stay busy instead of waiting for the data to arrive.
However, the more extra workers you have, (a) the more overhead, but also (b) – the more data you want to cram into the cache, meaning the smaller the chunk of each worker’s problem you can actually fit in the cache at any one time – the core may have to end up waiting anyway, and could end up frequently switching from one to another.
So the optimal number is going to depend a lot on the individual problem and code being run, as well as your hardware.

EDIT:
and Julia must be invoked with -O3.
You don’t have to start Julia with -O3. I just like to.

On the SIMD sections, LLVM should do that for you if you start Julia with -O3. In my experience, it’s also faster than (me) doing it explicitly – and a heck of a lot easier!
@code_llvm will often show you simd code.

julia> using SIMD, BenchmarkTools

julia> function vadd!{N,T}(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}})
           @assert length(ys) == length(xs)
           @assert length(xs) % N == 0
           @inbounds for i in 1:N:length(xs)
               xv = vload(Vec{N,T}, xs, i)
               yv = vload(Vec{N,T}, ys, i)
               xv += yv
               vstore(xv, xs, i)
           end
       end
vadd! (generic function with 1 method)

julia> function lazy_vadd!(xs::Vector{T}, ys::Vector{T}) where T
           @assert length(ys) == length(xs)
           @inbounds for i in 1:length(xs)
               xs[i] += ys[i]
           end
       end
lazy_vadd! (generic function with 1 method)

julia> x = randn(800); y = randn(800);

julia> @benchmark vadd!($x, $y, Vec{4,Float64})
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.371 μs (0.00% GC)
  median time:      1.376 μs (0.00% GC)
  mean time:        1.457 μs (0.00% GC)
  maximum time:     8.108 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark lazy_vadd!($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.192 μs (0.00% GC)
  median time:      1.214 μs (0.00% GC)
  mean time:        1.250 μs (0.00% GC)
  maximum time:     4.172 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark vadd!($x, $y, Vec{8,Float64})
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.280 μs (0.00% GC)
  median time:      1.283 μs (0.00% GC)
  mean time:        1.314 μs (0.00% GC)
  maximum time:     4.988 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Note how I didn’t even have to add @simd to the for loop.
If you @code_llvm lazy_vadd!(x, y), you’ll get a big wall of text, which includes:

  %wide.load = load <4 x double>, <4 x double>* %37, align 8
  %38 = getelementptr double, double* %36, i64 4
  %39 = bitcast double* %38 to <4 x double>*
  %wide.load11 = load <4 x double>, <4 x double>* %39, align 8
  %40 = getelementptr double, double* %36, i64 8
  %41 = bitcast double* %40 to <4 x double>*
  %wide.load12 = load <4 x double>, <4 x double>* %41, align 8
  %42 = getelementptr double, double* %36, i64 12
  %43 = bitcast double* %42 to <4 x double>*
  %wide.load13 = load <4 x double>, <4 x double>* %43, align 8
  %44 = getelementptr double, double* %32, i64 %35
  %45 = bitcast double* %44 to <4 x double>*
  %wide.load14 = load <4 x double>, <4 x double>* %45, align 8
  %46 = getelementptr double, double* %44, i64 4
  %47 = bitcast double* %46 to <4 x double>*
  %wide.load15 = load <4 x double>, <4 x double>* %47, align 8
  %48 = getelementptr double, double* %44, i64 8
  %49 = bitcast double* %48 to <4 x double>*
  %wide.load16 = load <4 x double>, <4 x double>* %49, align 8
  %50 = getelementptr double, double* %44, i64 12
  %51 = bitcast double* %50 to <4 x double>*
  %wide.load17 = load <4 x double>, <4 x double>* %51, align 8
  %52 = fadd <4 x double> %wide.load, %wide.load14
  %53 = fadd <4 x double> %wide.load11, %wide.load15
  %54 = fadd <4 x double> %wide.load12, %wide.load16
  %55 = fadd <4 x double> %wide.load13, %wide.load17
  %56 = bitcast double* %36 to <4 x double>*
  store <4 x double> %52, <4 x double>* %56, align 8
  %57 = bitcast double* %38 to <4 x double>*
  store <4 x double> %53, <4 x double>* %57, align 8
  %58 = bitcast double* %40 to <4 x double>*
  store <4 x double> %54, <4 x double>* %58, align 8
  %59 = bitcast double* %42 to <4 x double>*
  store <4 x double> %55, <4 x double>* %59, align 8

so you can see it is in fact loading up vectors of length 4, doing SIMD operations on the entire vector, and then storing the 4x doubles.
But, without any of the work on your end to get Julia to do that! Plus, we don’t need the @assert on the vector being properly divisible, etc.


#11

I view addprocs() as an administrative command to be done occasionally in the REPL, not a normal programming command. It’s akin to starting up Julia with workers. I was baffled to see it in a for loop also.


#12

thx again. I think I get the same kind of result that Chris got in my PMaps for the highly parallelizable setup. More processes than physical cores and threads actually worked better. I looked for his code on his blog, but did not find it. I am guessing that he is probably using pmap, too.

I am more confused by the medium parallelism thread results. If I use 7 or 15 threads, it takes 0.14 . If I use 9, 12, or 23, it takes 0.50 or worse. Not just a little, but slower by a factor 3-4. I think a good recommendation coming out of my little benchmarking exercise is to be conservative when it comes to number of threads (not processes).

SIMD

I copied your lazy_vadd example. W-2140B CPU. julia 0.6.2. @ 3.20GHz. macOS.

I get the same speed (about 145ns) regardless of whether I use -O3 or not, and regardless of where I am using SIMD or not. I looked at the reams of code, and I see

  %43 = getelementptr double, double* %41, i64 2
  %44 = bitcast double* %43 to <2 x double>*
  %wide.load11 = load <2 x double>, <2 x double>* %44, align 8
  %45 = fadd <2 x double> %wide.load, %wide.load10
  %46 = fadd <2 x double> %wide.load9, %wide.load11
  %47 = bitcast double* %37 to <2 x double>*
  store <2 x double> %45, <2 x double>* %47, align 8```

I do not see the 4 x version. obviously, with 145 ns I have little to complain about, but I need to tell the readers whether julia is using the SIMD instructions or not; and/or whether the -O3 and/or using SIMD is required or not.

any idea?

regards,

/iaw


#13

I’m not seeing any difference between -O3 and the default -O2 here. In both cases I get llvm like I posted earlier. That was with:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: AMD FX(tm)-6300 Six-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Piledriver)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, bdver1)

But I get the same thing:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_LAPACK NO_LAPACKE ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

I also get something similar with:

julia> versioninfo()
Julia Version 0.7.0-DEV.4837
Commit 9ebd3a8* (2018-04-10 01:30 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)
Environment:

Across both computers and Julia versions, I however consistently see a difference between -O2 and -O3 with:

julia> using StaticArrays, BenchmarkTools

julia> a = @SMatrix randn(8,8); b = @SVector randn(8);

julia> @benchmark $a * $b #with -O2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.526 ns (0.00% GC)
  median time:      12.848 ns (0.00% GC)
  mean time:        12.883 ns (0.00% GC)
  maximum time:     28.683 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark $a * $b #Different session with -O3
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.995 ns (0.00% GC)
  median time:      9.066 ns (0.00% GC)
  mean time:        9.076 ns (0.00% GC)
  maximum time:     19.546 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @code_llvm a * b

There’s lots of SIMD with -O3, but not -O2 – again, on both of my computers.

Oddly, if I made “a” 4x4 and b 4, the -O2 version was faster on both computers. Keeping sizes the same but switching to single precision, or changing sizes made -O3 faster.
I am not sure why this is. I also tried cranking up BenchmarkTools.DEFAULT_PARAMETERS.samples, but that didn’t change anything.

FWIW, the second processor was already running under full load (all physical cores at 100%) during those benchmarks, which isn’t ideal for reliable timings.

How fast are vadd! and lazy_vadd! ?
A generic llvm that doesn’t recognize zen processors still emits SIMD instructions, because it still recognizes the processor as capable of them. From the terminal:

$ cat /proc/cpuinfo | grep flags
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate retpoline retpoline_amd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Specifically, look for avx2 (burried about 3/4 of the way through). Googling, your processor is extremely knew and should even have avx-512. While avx2 lets your CPU operate on 256 bits at a time (4 doubles), avx-512 is 512 bits – or 8 doubles! (Or 16 singles?)
Although I don’t know if this works the same way on a mac. Would bet it doesn’t on Windows.

For what it’s worth, I built Julia from source on both computers. But I don’t think it ought to matter. On a similar note though, if you downloaded a prebuilt binary, you could try rebuilding the Julia system image:

include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl")); build_sysimg(force=true)

If you want a better answer, maybe you could start a new thread so others see it. Lots of posters know way more than I do about all this stuff.
Before sweating too much over this, the SIMD versions of StaticArrays were slower on both my computers for some reason.


#14

Yup. It was just a simple Euler-Maruyama loop for solving single SDE projectories, pmap'd to do a few million trajectories or something like that. Nothing fancy, hyper-parallel.

Multithreading has a much lower overhead and shouldn’t need more threads than cores.


#15

Thankyou for all the insightful replies to this thread.
Exposing my ignorance, what is the technique to perform NUMA process affinity/process pinning with Julia?

This rather good article discusses the very good advice regarding (a) running within a cpuset (b) setting IRQ affinities © setting performance governors. All good things which I am familiar with from work in HPC for years
)as an aside on memory tuning - always look at min_free_kbytes. I think of it as the ‘wiggle room’ a Linux system has when it is plumb out of memory - it is set worringly low in many older Linuxes)
But no explicit mention of numactl and process pinning - you want to pin high performance codes to remain on a given processor core to avoid cache line invalidation.

Chris Rackaukas article is superb also http://www.stochasticlifestyle.com/236-2/
Note the finding about running on FIFTEEN workers - an odd number.
It echoes what I have spouted on about on other forums - the concept of a ‘donkey engine’ - which is a small, easily started engine which then turns over and starts a huge marine diesel engine. So an ideal CPU die would have a lower powered core reserved for OS tasks.
Of course, in the real world it is cheaper/easier to have all cores identical, so you then look at
https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/linuxtips.md
and use cpusets, reserving a core or two for the OS tasks.


#16

john—I am sorry, but I have never had access to a NUMA machine, so I can’t tell you anything.

elrod—thanks also for the SIMD info. I purchased a julia-computing support contract, partly to support julia, partly to help me answer the more complex questions. so I could ask the folks over there who should know the answer. alas, right now, I am trying to keep their distractions to a min, as they are preparing for the 1.0 release. this SIMD stuff is not so urgent that I cannot wait for it. so, we will get an answer, but it may take a while. I plan to post the answer on my cookbook and here (if it doesn’t slip my mind).

I am also working on improving the benchmarks for the threads and workers. I am seeing inconsistencies when I repeat runs. I know parallel is always a bit inconsistent, but what I am seeing is beyond what I should be seeing. when the time comes, I may beg you and Chris to run a program for me to see if you see similar weird results. because you have an AMD, this will be telling, too.

/iaw


#17

@iwelch You probably do have access to a NUMA machine. Think of a dual socket server.
Being Linux specific here, run the command:
numactl --hardware
You will get a list of the NUMA domains on the server and the cores and memory in each.

Going slightly off topic, with Cluster On Die snooping even a single socket can appear to be a NUMA node


I got myself tied in knots with that one once. I believe in the Skylake generation though this option is no longer available.


#18

I am posting some benchmark-based advice in another thread in a moment.