Can addprocs() function only be used at the start of code?
@everywhere function samplefun(x::Real) 2.0 end
addprocs(2) ## if here, it does not work
@everywhere M= 10
sa = SharedArray{Float64}(M)
s= @sync @parallel for i=1:M; sa[i] = samplefun(M); end#for
info( sum(sa) )
if this is so, it would be nice for the compiler to warn the user about subsequent @everywhere. this one took a while to isolate.
I assume the problem was that samplefun wasnât defined on the two procs you added?
Basically, you need to define the function on each process you want to use said function on. Running function samplefun(x::Real) 2.0 end everywhere means running it on all processes. You canât run something on a process that doesnât exist yet.
I do not know how @everywhere worksâjust that it somehow tells Julia that the function needs to become âgloballyâ accessible by other procs. I thought @everywhere just tagged a function for âoutsourcingâ later on when the processes spring into action (or added it to a global-type namespace orâŚ). but from your explanation, it somehow transfers it to all already existing processors at the time. this explains my problem.
related questionâsay my master reads a function from a file or the terminal (say a string containing s= function(x) sum(sin.(x))). the master now wants to pass this string s with its contents on to its workers for parallel processing. so, the master would first addproc() to create its workers, then somehow stick @everywhere in front of s or the the function source, and/or compile the function, and/or pass it on to the workers?! does someone have a simple example for this?
and how one converts an existing plain function into an @everywhere function?!
another related question. I need to get my terminology straight. on my CPU/OS, I am comfortable. on Julia there are cores, threads, processes, and tasks (@async?), not to speak of channels, remotechannels, etc. I think Julia uses variants of the jargon I know from my OS. could someone clarify this here? I think julia calls a âprocessâ what my CPU calls a âthreadââthat is, a process in Julia is a pseudo cpu core. a task is what @async creates, but tasks always run on the current process (i.e., CPU core, usually master). etc. this could all be wrong. is there a glossary somewhere that lays this all out?
Processes are different from threads. Open a task manager, and youâll see a list of running processes. If you addprocs(8), there will be 1+8 Julia processes running, alongside your internet browser, preferred IDE, and whatever else you have running.
They each have their own, independent memory, If you want one of them to do something, youâll have to tell them. An exception is that global variables defined in Main (ie, the module youâre in when you start the REPL) will transfer automatically if referenced in remotecalls, although that isnât recommended. https://docs.julialang.org/en/stable/manual/parallel-computing
Processes normally do 1 thing at a time (running at 100% of a CPU core), so making more processes is an easy way around that.
Another strategy is to start Julia with more than 1 thread. export JULIA_NUM_THREADS=8 will start Julia with 8 threads. Julia will then consist of only a single process instead of 8, and all threads will share the same memory. Define a function, and theyâre all aware.
Running things on more than one thread (eg, split up a for loop via @threads) will let them each work.
Threading tends to have less overhead, and thus to be faster.
The difference youâd see in the process monitor is that when you have 8 processes/threads working full blast with addprocs, youâd see 8 processes each running at 100%, while with threading youâd see a single process running at 800%.
So, given that each process is a semi-separate child of the parent/master process, just think that @everywhere expression runs the expression on all available processes (everywhere), as if you had a bunch of different Julia sessions open, and ran that line in each of them.
If you afterward decide to add 2 more Julia sessions to your roster, theyâll be âfreshâ, brand new processes. Youâll have to go back and run things on them; theyâre not going to automatically know about stuff you ran on other processes before they were born.
Of course, adding processes is a lot more convenient than starting a bunch of sessions if you want to either run similar things on each, and transfer data between them.
If you want to include a file that has a bunch of Julia functions everywhere, just @everywhere include("path/to/filename.jl").
Or, @everywhere using ModuleWithFunctions if you have them in a module on your path.
so, if I have an i7 with 4 cores (8 threads), hardware-wise, and I am running on just one computer, will threads take advantage of the i7âs all 8 CPU threads ability, too?
I did not even know that one os CPU process can operate many threads under linux/macos. I have only ever seen multitasking via separate os CPU processes.
Three export is only good for that terminal tab, so youâd have to launch Julia from the same place.
Juno/Atom does this automatically, just check your Julia settings.
No support in VSCode yet.
I think if you launch IJulia from a Julia session with 8 threads, youâll have that many in the IJulia notebook too, but I didnât test.
To use more than one thread, itâs just
Threads.@threads for i in 1:n
Stuff
end
But be careful of the closure bug (can get around via let blocks, or accessing thibgs through refs / fields of a mutable struct â worst case scenario, have those refs or mutable struct global constants), and of false sharing.
I recall some discussions where cache sizes were a problem, but hopefully you donât have to worry about anything so low level.
You should get around 4x speed, maybe slightly more, from using 8 threads. 4 will probably get you just under 4x.
sorry, elrod. I am still strugging with @everywhere use.
for m=1:10
addprocs(1) ## everywhere's must come *after*
@everywhere M=m
@everywhere f(inme) = inme^2
## run my stuff
end
I want to turn existing objects into @everywhere. for example, m. (could also be another function g, instead.)
ERROR: LoadError: On worker 2:
UndefVarError: m not defined
can this be done? or should I think of number of processors more like a static unchangeable feature, just like number of threads? (it is surprising to me that there is an addproc() but not an addthreads().)
julia> addprocs(6);
julia> @everywhere for m = 1:6
global M = m
end
julia> @everywhere return_m() = M
julia> remotecall_fetch(return_m, 2)
6
julia> remote_do(() -> (global M = 4), 4)
julia> remotecall_fetch(return_m, 4)
4
It is a bad idea to use global variables like that. This is just an example.
writing benchmarks that compare performance with different number of processors.
and for understanding how this all works. after all, I need to have a modest understanding for the sake of at http://julia.cookbook.tips/doku.php?id=parallel ⌠(PS: thanking you at the end of the chapter.)
With threads, I can get almost linear speedup, but itâs somewhat inconsistent based on number of threads and cores. It can even go up when mustering more cores.
The closure bug: performance of captured variables in closures ¡ Issue #15276 ¡ JuliaLang/julia ¡ GitHub
Sometimes type inference fails on variables caught in a closure. The @threads macro creates a few variables, and captures some in a closure. On 0.7, it automatically creates a let block which helps in most cases.
It doesnât come up most of the time, but often enough that you should be aware of it. If you suddenly get worse performance, thatâs the first place Iâd look (@code_warntype and using Traceur into @trace should find the resulting type instabilities).
ChrisRackauckas blogged about the optimal number of workers:
On a 4-core 8-thread i7 (like what you have), he actually saw the fastest speed with 15 workers.
He explains things more in depth there, but the gist is that the busier your cores, the faster itâll crunch numbers.
Unfortunately, if the next thing they have to compute on isnât in a cache close to the compute units, they will have to wait for the data to move there.
If you have extra worker processes, the core could just switch to one of those extras and stay busy instead of waiting for the data to arrive.
However, the more extra workers you have, (a) the more overhead, but also (b) â the more data you want to cram into the cache, meaning the smaller the chunk of each workerâs problem you can actually fit in the cache at any one time â the core may have to end up waiting anyway, and could end up frequently switching from one to another.
So the optimal number is going to depend a lot on the individual problem and code being run, as well as your hardware.
EDIT: and Julia must be invoked with -O3.
You donât have to start Julia with -O3. I just like to.
On the SIMD sections, LLVM should do that for you if you start Julia with -O3. In my experience, itâs also faster than (me) doing it explicitly â and a heck of a lot easier! @code_llvm will often show you simd code.
julia> using SIMD, BenchmarkTools
julia> function vadd!{N,T}(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}})
@assert length(ys) == length(xs)
@assert length(xs) % N == 0
@inbounds for i in 1:N:length(xs)
xv = vload(Vec{N,T}, xs, i)
yv = vload(Vec{N,T}, ys, i)
xv += yv
vstore(xv, xs, i)
end
end
vadd! (generic function with 1 method)
julia> function lazy_vadd!(xs::Vector{T}, ys::Vector{T}) where T
@assert length(ys) == length(xs)
@inbounds for i in 1:length(xs)
xs[i] += ys[i]
end
end
lazy_vadd! (generic function with 1 method)
julia> x = randn(800); y = randn(800);
julia> @benchmark vadd!($x, $y, Vec{4,Float64})
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.371 Îźs (0.00% GC)
median time: 1.376 Îźs (0.00% GC)
mean time: 1.457 Îźs (0.00% GC)
maximum time: 8.108 Îźs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark lazy_vadd!($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.192 Îźs (0.00% GC)
median time: 1.214 Îźs (0.00% GC)
mean time: 1.250 Îźs (0.00% GC)
maximum time: 4.172 Îźs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark vadd!($x, $y, Vec{8,Float64})
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.280 Îźs (0.00% GC)
median time: 1.283 Îźs (0.00% GC)
mean time: 1.314 Îźs (0.00% GC)
maximum time: 4.988 Îźs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
Note how I didnât even have to add @simd to the for loop.
If you @code_llvm lazy_vadd!(x, y), youâll get a big wall of text, which includes:
%wide.load = load <4 x double>, <4 x double>* %37, align 8
%38 = getelementptr double, double* %36, i64 4
%39 = bitcast double* %38 to <4 x double>*
%wide.load11 = load <4 x double>, <4 x double>* %39, align 8
%40 = getelementptr double, double* %36, i64 8
%41 = bitcast double* %40 to <4 x double>*
%wide.load12 = load <4 x double>, <4 x double>* %41, align 8
%42 = getelementptr double, double* %36, i64 12
%43 = bitcast double* %42 to <4 x double>*
%wide.load13 = load <4 x double>, <4 x double>* %43, align 8
%44 = getelementptr double, double* %32, i64 %35
%45 = bitcast double* %44 to <4 x double>*
%wide.load14 = load <4 x double>, <4 x double>* %45, align 8
%46 = getelementptr double, double* %44, i64 4
%47 = bitcast double* %46 to <4 x double>*
%wide.load15 = load <4 x double>, <4 x double>* %47, align 8
%48 = getelementptr double, double* %44, i64 8
%49 = bitcast double* %48 to <4 x double>*
%wide.load16 = load <4 x double>, <4 x double>* %49, align 8
%50 = getelementptr double, double* %44, i64 12
%51 = bitcast double* %50 to <4 x double>*
%wide.load17 = load <4 x double>, <4 x double>* %51, align 8
%52 = fadd <4 x double> %wide.load, %wide.load14
%53 = fadd <4 x double> %wide.load11, %wide.load15
%54 = fadd <4 x double> %wide.load12, %wide.load16
%55 = fadd <4 x double> %wide.load13, %wide.load17
%56 = bitcast double* %36 to <4 x double>*
store <4 x double> %52, <4 x double>* %56, align 8
%57 = bitcast double* %38 to <4 x double>*
store <4 x double> %53, <4 x double>* %57, align 8
%58 = bitcast double* %40 to <4 x double>*
store <4 x double> %54, <4 x double>* %58, align 8
%59 = bitcast double* %42 to <4 x double>*
store <4 x double> %55, <4 x double>* %59, align 8
so you can see it is in fact loading up vectors of length 4, doing SIMD operations on the entire vector, and then storing the 4x doubles.
But, without any of the work on your end to get Julia to do that! Plus, we donât need the @assert on the vector being properly divisible, etc.
I view addprocs() as an administrative command to be done occasionally in the REPL, not a normal programming command. Itâs akin to starting up Julia with workers. I was baffled to see it in a for loop also.
thx again. I think I get the same kind of result that Chris got in my PMaps for the highly parallelizable setup. More processes than physical cores and threads actually worked better. I looked for his code on his blog, but did not find it. I am guessing that he is probably using pmap, too.
I am more confused by the medium parallelism thread results. If I use 7 or 15 threads, it takes 0.14 . If I use 9, 12, or 23, it takes 0.50 or worse. Not just a little, but slower by a factor 3-4. I think a good recommendation coming out of my little benchmarking exercise is to be conservative when it comes to number of threads (not processes).
SIMD
I copied your lazy_vadd example. W-2140B CPU. julia 0.6.2. @ 3.20GHz. macOS.
I get the same speed (about 145ns) regardless of whether I use -O3 or not, and regardless of where I am using SIMD or not. I looked at the reams of code, and I see
%43 = getelementptr double, double* %41, i64 2
%44 = bitcast double* %43 to <2 x double>*
%wide.load11 = load <2 x double>, <2 x double>* %44, align 8
%45 = fadd <2 x double> %wide.load, %wide.load10
%46 = fadd <2 x double> %wide.load9, %wide.load11
%47 = bitcast double* %37 to <2 x double>*
store <2 x double> %45, <2 x double>* %47, align 8```
I do not see the 4 x version. obviously, with 145 ns I have little to complain about, but I need to tell the readers whether julia is using the SIMD instructions or not; and/or whether the -O3 and/or using SIMD is required or not.
Across both computers and Julia versions, I however consistently see a difference between -O2 and -O3 with:
julia> using StaticArrays, BenchmarkTools
julia> a = @SMatrix randn(8,8); b = @SVector randn(8);
julia> @benchmark $a * $b #with -O2
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.526 ns (0.00% GC)
median time: 12.848 ns (0.00% GC)
mean time: 12.883 ns (0.00% GC)
maximum time: 28.683 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark $a * $b #Different session with -O3
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.995 ns (0.00% GC)
median time: 9.066 ns (0.00% GC)
mean time: 9.076 ns (0.00% GC)
maximum time: 19.546 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @code_llvm a * b
Thereâs lots of SIMD with -O3, but not -O2 â again, on both of my computers.
Oddly, if I made âaâ 4x4 and b 4, the -O2 version was faster on both computers. Keeping sizes the same but switching to single precision, or changing sizes made -O3 faster.
I am not sure why this is. I also tried cranking up BenchmarkTools.DEFAULT_PARAMETERS.samples, but that didnât change anything.
FWIW, the second processor was already running under full load (all physical cores at 100%) during those benchmarks, which isnât ideal for reliable timings.
How fast are vadd! and lazy_vadd! ?
A generic llvm that doesnât recognize zen processors still emits SIMD instructions, because it still recognizes the processor as capable of them. From the terminal:
Specifically, look for avx2 (burried about 3/4 of the way through). Googling, your processor is extremely knew and should even have avx-512. While avx2 lets your CPU operate on 256 bits at a time (4 doubles), avx-512 is 512 bits â or 8 doubles! (Or 16 singles?)
Although I donât know if this works the same way on a mac. Would bet it doesnât on Windows.
For what itâs worth, I built Julia from source on both computers. But I donât think it ought to matter. On a similar note though, if you downloaded a prebuilt binary, you could try rebuilding the Julia system image:
If you want a better answer, maybe you could start a new thread so others see it. Lots of posters know way more than I do about all this stuff.
Before sweating too much over this, the SIMD versions of StaticArrays were slower on both my computers for some reason.
Yup. It was just a simple Euler-Maruyama loop for solving single SDE projectories, pmapâd to do a few million trajectories or something like that. Nothing fancy, hyper-parallel.
Multithreading has a much lower overhead and shouldnât need more threads than cores.
Thankyou for all the insightful replies to this thread.
Exposing my ignorance, what is the technique to perform NUMA process affinity/process pinning with Julia?
This rather good article discusses the very good advice regarding (a) running within a cpuset (b) setting IRQ affinities (c) setting performance governors. All good things which I am familiar with from work in HPC for years
)as an aside on memory tuning - always look at min_free_kbytes. I think of it as the âwiggle roomâ a Linux system has when it is plumb out of memory - it is set worringly low in many older Linuxes)
But no explicit mention of numactl and process pinning - you want to pin high performance codes to remain on a given processor core to avoid cache line invalidation.
Chris Rackaukas article is superb also http://www.stochasticlifestyle.com/236-2/
Note the finding about running on FIFTEEN workers - an odd number.
It echoes what I have spouted on about on other forums - the concept of a âdonkey engineâ - which is a small, easily started engine which then turns over and starts a huge marine diesel engine. So an ideal CPU die would have a lower powered core reserved for OS tasks.
Of course, in the real world it is cheaper/easier to have all cores identical, so you then look at https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/linuxtips.md
and use cpusets, reserving a core or two for the OS tasks.
johnâI am sorry, but I have never had access to a NUMA machine, so I canât tell you anything.
elrodâthanks also for the SIMD info. I purchased a julia-computing support contract, partly to support julia, partly to help me answer the more complex questions. so I could ask the folks over there who should know the answer. alas, right now, I am trying to keep their distractions to a min, as they are preparing for the 1.0 release. this SIMD stuff is not so urgent that I cannot wait for it. so, we will get an answer, but it may take a while. I plan to post the answer on my cookbook and here (if it doesnât slip my mind).
I am also working on improving the benchmarks for the threads and workers. I am seeing inconsistencies when I repeat runs. I know parallel is always a bit inconsistent, but what I am seeing is beyond what I should be seeing. when the time comes, I may beg you and Chris to run a program for me to see if you see similar weird results. because you have an AMD, this will be telling, too.
@iwelch You probably do have access to a NUMA machine. Think of a dual socket server.
Being Linux specific here, run the command:
numactl --hardware
You will get a list of the NUMA domains on the server and the cores and memory in each.
Going slightly off topic, with Cluster On Die snooping even a single socket can appear to be a NUMA node
I got myself tied in knots with that one once. I believe in the Skylake generation though this option is no longer available.