# Simple Parallel Examples for Embarrassingly Simple Problems

#1

macOS, julia 0.6.2, Feb 2018, captain’s log: I have been playing around with trying to understand (the docs on) parallelism. as an early example, I just want to create, in parallel, a list of

``````function logp1(x::Int64)::Float64 log(1.0+x); end#function
``````

Because I want to check the result, I later add them up, but you can assume that I really do not necessarily want the sum-reduce.

``````function logp1(x::Int64)::Float64 log(1.0+x); end#function
M= 1000000
global result

println("processors: ", nprocs())
println("compile and test: ", logp1( 1 ) )

@time begin
n=Vector{Float64}(M)
for i=1:M; n[i]= logp1(i); end#for
println("\nplain loop: ", sum(n))
end

@time begin
r= pmap( logp1, 1:M )
println( "\npmap: ", sum(r))
end

@time begin
a=SharedArray{Float64}(M)
s= @parallel (+) for i=1:M; a[i]=logp1(i); end#for
println("\nsharedarray reducer parallel: ", s)
end

@time begin
a=SharedArray{Float64}(M)
s= @parallel for i=1:M; a[i]=logp1(i); end#for
fetch(s)
println("\nFAILS sharedarray parallel: ", sum(a))
end
``````
1. I need some basic help—the last example is right out of the docs, but forgets the basic hint of how the program fetches the results!?
2. can I also set the number of processors inside a running julia (ala R?), or is it only settable on startup, like `julia -p 6`?
3. when I try this not just with one processor, but with more (`-p 2`), my pmap complains
``````ERROR: LoadError: On worker 2:
UndefVarError: #logp1 not defined
``````

although the function `logp1` was of course nicely defined in the global space upfront.

help appreciated. corrected example based on responses is posted below.

#2

I believe you need to share your function with the workers. To do this just define it with @everywhere . That is,

``````@everywhere function logp1(x::Int64)::Float64 log(1.0+x); end#function
``````

Then again, if you just wan the answer, you probably don’t need parallel or a loop - you can just calculate it at once as a vector like `log.(1 .+ 1:1000)`

#3

Since you allocated the array `a` before the loop, you don’t need to fetch anything. In this case it simply blocks the code from continuing until the loop is complete. I would just add an `@sync` directly in front of the `@parallel` instead.

Use `addprocs(N)` to add processors while running Julia. `rmprocs(pid)` can be used to remove processors by their process id (use `procs` to see the id’s of running processors).

#4

thank you, pasha. indeed, with the `@everywhere` in front of logp1, it works. (of course, the obvious `log(1.0+1:M)` was the way to calculate this; I just needed some function for test purposes.)

thank you, ksmcreynolds. the `@sync` before the @parallel solves the collection problem. and `addprocs(Int)` allows dynamic use of number of processors. for reference, please see corrected example below.

ksmcreynolds: one strange aspect is that it

``````julia> nprocs()
6

julia> rmprocs(3)

julia> nprocs()
5
``````

and further changes do not seem to matter.

`pmap` seems so incredibly slow that it seems nearly unusable. this is not because of my example above (where, of course, the ops is small relative to the overhead), but because my “mental benchmark” is the equivalent slow R function that typically copies the entire Unix process (yes!) and still manages

``````> t <- Sys.time(); x=mclapply( 1:1000000, function(x) log(1.0+x) );
> print(Sys.time()-t)
Time difference of 0.5312 secs
``````

in contrast julia took 50 seconds on pmap when multiple cores were available to processes, and 5 seconds when only 1 core was available (presumably, with the process internally falling back on not spawning processes).

#5

Just FYI, `Base.log1p` exists.

#6

And it’s more accurate for small x.

``````julia> log(1+1e-17)
0.0

julia> log1p(1e-17)
1.0e-17
``````

#7

And there is `Base.Math.JuliaLibm.log1p`, which is a bit better.

#8

deleted in favor of mohamed’s post.

#9

Don’t time in global scope, use a function instead. And use `@btime` of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. `Threads.@threads` is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:

``````using BenchmarkTools

function simple_loop_sum()
M = 1000000
n = Vector{Float64}(M)
for i=1:M; n[i] = log1p(i); end#for
return sum(n)
end

function sharedarray_parallel_sum()
M = 1000000
a = SharedArray{Float64}(M)
s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
return sum(a)
end

function pmap_sum()
M = 1000000
r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end

function sharedarray_mapreduce()
M = 1000000
a=SharedArray{Float64}(M)
s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
return s
end

M = 1000000
a=Vector{Float64}(M)
a[i]=log1p(i)
end#for
return sum(a)
end

println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())

@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
#4.039 ms (3 allocations: 7.63 MiB)
``````

The `pmap` version is really absurd but I am not sure why.

#10

change the batch size

#11

Ok now significantly less absurd:

``````function pmap_sum()
M = 1000000
r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
``````

#12

I don’t think it should matter for this case, but try adding a caching pool.

#13

One thing I’ve been wondering about is the difference between mutithreading and parallel processing. Can someone explain to me when I should use the former when the latter, or point me to some? Up till now I’ve been just using threading because it seems like a simpler way of getting a performance increase. I tried doing some parallel calculations once, but that was severely lacking because I couldn’t define `SharedArray` for my own types, and I had around 400mb of data being used in each of the parallel processes/ threads.

#14

Multiprocessing, which we’re calling parallel here, is not shared memory. It can be distributed across multiple computers (nodes of an HPC). You don’t want to use a `SharedArray` unless you have to: you should limit the amount of data that is being shared and be careful about exchanging data. But it scales to much larger problems since you can use thousands/millions/billions of cores.

#15

Right, That was kind of what I was thinking too, thanks for the clarification! Basically whenever I can partition my data in such a way that is a standalone portion of the final problem, parallel processing would be a good idea, whereas if all parts of the divisible problem need to same/full set of data, threading would be a better idea. Is that right?

#16

Kind of. It’s more like, if you’re on a single shared memory machine (i.e. one computer) you should probably use multithreading. Anything else needs multiprocessing. (Though there can be some extra complications)

#17

If that’s what you mean, it doesn’t really help.

``````function pmap_sum()
M = 1000000
r = pmap(CachingPool(workers()),log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
@btime pmap_sum()
#4.379 s (7014529 allocations: 181.59 MiB)
``````

#18

#19

there is something else I do not understand.

why is mohamed’s `simple_loop_sum()` slower than parallel other versions, specifically `sharedarray_parallel_sum()`, even with one processor (nprocs()==1)?? (I also confirmed it on my own computer.)

#20

I cannot observe what you are saying. These are the timings for 1 processor.

``````julia> @btime simple_loop_sum();
12.584 ms (2 allocations: 7.63 MiB)

julia> @btime sharedarray_parallel_sum();
14.997 ms (209 allocations: 10.00 KiB)

julia> @btime pmap_sum();
2.154 s (7999569 allocations: 152.59 MiB)

julia> @btime sharedarray_mapreduce();
15.281 ms (205 allocations: 9.72 KiB)