Simple Parallel Examples for Embarrassingly Simple Problems

macOS, julia 0.6.2, Feb 2018, captain’s log: I have been playing around with trying to understand (the docs on) parallelism. as an early example, I just want to create, in parallel, a list of

function logp1(x::Int64)::Float64 log(1.0+x); end#function

Because I want to check the result, I later add them up, but you can assume that I really do not necessarily want the sum-reduce.

function logp1(x::Int64)::Float64 log(1.0+x); end#function
M= 1000000
global result

println("processors: ", nprocs())
println("compile and test: ", logp1( 1 ) )

@time begin
    n=Vector{Float64}(M)
    for i=1:M; n[i]= logp1(i); end#for
    println("\nplain loop: ", sum(n))
end

@time begin
    r= pmap( logp1, 1:M )
    println( "\npmap: ", sum(r))
end

@time begin
    a=SharedArray{Float64}(M)
    s= @parallel (+) for i=1:M; a[i]=logp1(i); end#for
    println("\nsharedarray reducer parallel: ", s)
end

@time begin
    a=SharedArray{Float64}(M)
    s= @parallel for i=1:M; a[i]=logp1(i); end#for
    fetch(s)
    println("\nFAILS sharedarray parallel: ", sum(a))
end
  1. I need some basic help—the last example is right out of the docs, but forgets the basic hint of how the program fetches the results!?
  2. can I also set the number of processors inside a running julia (ala R?), or is it only settable on startup, like julia -p 6?
  3. when I try this not just with one processor, but with more (-p 2), my pmap complains
ERROR: LoadError: On worker 2:
UndefVarError: #logp1 not defined

although the function logp1 was of course nicely defined in the global space upfront.

help appreciated. corrected example based on responses is posted below.

I believe you need to share your function with the workers. To do this just define it with @everywhere . That is,

@everywhere function logp1(x::Int64)::Float64 log(1.0+x); end#function

Then again, if you just wan the answer, you probably don’t need parallel or a loop - you can just calculate it at once as a vector like log.(1 .+ 1:1000)

1 Like

Since you allocated the array a before the loop, you don’t need to fetch anything. In this case it simply blocks the code from continuing until the loop is complete. I would just add an @sync directly in front of the @parallel instead.

Use addprocs(N) to add processors while running Julia. rmprocs(pid) can be used to remove processors by their process id (use procs to see the id’s of running processors).

1 Like

thank you, pasha. indeed, with the @everywhere in front of logp1, it works. (of course, the obvious log(1.0+1:M) was the way to calculate this; I just needed some function for test purposes.)

thank you, ksmcreynolds. the @sync before the @parallel solves the collection problem. and addprocs(Int) allows dynamic use of number of processors. for reference, please see corrected example below.

ksmcreynolds: one strange aspect is that it

julia> nprocs()
6

julia> rmprocs(3)
Task (done) @0x0000000126edb850

julia> nprocs()
5

and further changes do not seem to matter.

pmap seems so incredibly slow that it seems nearly unusable. this is not because of my example above (where, of course, the ops is small relative to the overhead), but because my “mental benchmark” is the equivalent slow R function that typically copies the entire Unix process (yes!) and still manages

> t <- Sys.time(); x=mclapply( 1:1000000, function(x) log(1.0+x) );
> print(Sys.time()-t)
Time difference of 0.5312 secs

in contrast julia took 50 seconds on pmap when multiple cores were available to processes, and 5 seconds when only 1 core was available (presumably, with the process internally falling back on not spawning processes).

Just FYI, Base.log1p exists.

1 Like

And it’s more accurate for small x.

julia> log(1+1e-17)
0.0

julia> log1p(1e-17)
1.0e-17

And there is Base.Math.JuliaLibm.log1p, which is a bit better.

1 Like

deleted in favor of mohamed’s post.

Don’t time in global scope, use a function instead. And use @btime of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:

using BenchmarkTools

function simple_loop_sum()
    M = 1000000
    n = Vector{Float64}(M)
    for i=1:M; n[i] = log1p(i); end#for
    return sum(n)
end

function sharedarray_parallel_sum()
    M = 1000000
    a = SharedArray{Float64}(M)
    s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
    return sum(a)
end

function pmap_sum()
    M = 1000000
    r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
    return sum(r)
end

function sharedarray_mapreduce()
    M = 1000000
    a=SharedArray{Float64}(M)
    s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
    return s
end

function threads_sum()
    M = 1000000
    a=Vector{Float64}(M)
    Threads.@threads for i=1:M
        a[i]=log1p(i)
    end#for
    return sum(a)
end

println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())

@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)

The pmap version is really absurd but I am not sure why.

1 Like

change the batch size

1 Like

Ok now significantly less absurd:

function pmap_sum()
	M = 1000000
	r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
	return sum(r)
end
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
1 Like

I don’t think it should matter for this case, but try adding a caching pool.

One thing I’ve been wondering about is the difference between mutithreading and parallel processing. Can someone explain to me when I should use the former when the latter, or point me to some? Up till now I’ve been just using threading because it seems like a simpler way of getting a performance increase. I tried doing some parallel calculations once, but that was severely lacking because I couldn’t define SharedArray for my own types, and I had around 400mb of data being used in each of the parallel processes/ threads.

Multiprocessing, which we’re calling parallel here, is not shared memory. It can be distributed across multiple computers (nodes of an HPC). You don’t want to use a SharedArray unless you have to: you should limit the amount of data that is being shared and be careful about exchanging data. But it scales to much larger problems since you can use thousands/millions/billions of cores.

Right, That was kind of what I was thinking too, thanks for the clarification! Basically whenever I can partition my data in such a way that is a standalone portion of the final problem, parallel processing would be a good idea, whereas if all parts of the divisible problem need to same/full set of data, threading would be a better idea. Is that right?

Kind of. It’s more like, if you’re on a single shared memory machine (i.e. one computer) you should probably use multithreading. Anything else needs multiprocessing. (Though there can be some extra complications)

1 Like

If that’s what you mean, it doesn’t really help.

function pmap_sum()
   M = 1000000
   r = pmap(CachingPool(workers()),log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
   return sum(r)
end
@btime pmap_sum()
#4.379 s (7014529 allocations: 181.59 MiB)

mohamed—I am changing the solution to your answer. if you see a “caching pool” improvement, please edit to add it to your previous answer. for now, can you please edit your earlier post to add the batch_size version to the plain one? regards, /iaw

there is something else I do not understand.

why is mohamed’s simple_loop_sum() slower than parallel other versions, specifically sharedarray_parallel_sum(), even with one processor (nprocs()==1)?? (I also confirmed it on my own computer.)

I cannot observe what you are saying. These are the timings for 1 processor.

julia> @btime simple_loop_sum();
  12.584 ms (2 allocations: 7.63 MiB)

julia> @btime sharedarray_parallel_sum();
  14.997 ms (209 allocations: 10.00 KiB)

julia> @btime pmap_sum();
  2.154 s (7999569 allocations: 152.59 MiB)

julia> @btime sharedarray_mapreduce();
  15.281 ms (205 allocations: 9.72 KiB)

julia> @btime threads_sum()
  12.563 ms (3 allocations: 7.63 MiB)