Simple Parallel Examples for Embarrassingly Simple Problems

ivo_welch · February 3, 2018, 2:04am

macOS, julia 0.6.2, Feb 2018, captain’s log: I have been playing around with trying to understand (the docs on) parallelism. as an early example, I just want to create, in parallel, a list of

function logp1(x::Int64)::Float64 log(1.0+x); end#function

Because I want to check the result, I later add them up, but you can assume that I really do not necessarily want the sum-reduce.

function logp1(x::Int64)::Float64 log(1.0+x); end#function
M= 1000000
global result

println("processors: ", nprocs())
println("compile and test: ", logp1( 1 ) )

@time begin
    n=Vector{Float64}(M)
    for i=1:M; n[i]= logp1(i); end#for
    println("\nplain loop: ", sum(n))
end

@time begin
    r= pmap( logp1, 1:M )
    println( "\npmap: ", sum(r))
end

@time begin
    a=SharedArray{Float64}(M)
    s= @parallel (+) for i=1:M; a[i]=logp1(i); end#for
    println("\nsharedarray reducer parallel: ", s)
end

@time begin
    a=SharedArray{Float64}(M)
    s= @parallel for i=1:M; a[i]=logp1(i); end#for
    fetch(s)
    println("\nFAILS sharedarray parallel: ", sum(a))
end

I need some basic help—the last example is right out of the docs, but forgets the basic hint of how the program fetches the results!?
can I also set the number of processors inside a running julia (ala R?), or is it only settable on startup, like julia -p 6?
when I try this not just with one processor, but with more (-p 2), my pmap complains

ERROR: LoadError: On worker 2:
UndefVarError: #logp1 not defined

although the function logp1 was of course nicely defined in the global space upfront.

help appreciated. corrected example based on responses is posted below.

pasha · February 3, 2018, 3:02am

I believe you need to share your function with the workers. To do this just define it with @everywhere . That is,

@everywhere function logp1(x::Int64)::Float64 log(1.0+x); end#function

Then again, if you just wan the answer, you probably don’t need parallel or a loop - you can just calculate it at once as a vector like log.(1 .+ 1:1000)

ksmcreynolds · February 3, 2018, 3:09am

Since you allocated the array a before the loop, you don’t need to fetch anything. In this case it simply blocks the code from continuing until the loop is complete. I would just add an @sync directly in front of the @parallel instead.

Use addprocs(N) to add processors while running Julia. rmprocs(pid) can be used to remove processors by their process id (use procs to see the id’s of running processors).

ivo_welch · February 3, 2018, 4:28am

thank you, pasha. indeed, with the @everywhere in front of logp1, it works. (of course, the obvious log(1.0+1:M) was the way to calculate this; I just needed some function for test purposes.)

thank you, ksmcreynolds. the @sync before the @parallel solves the collection problem. and addprocs(Int) allows dynamic use of number of processors. for reference, please see corrected example below.

ksmcreynolds: one strange aspect is that it

julia> nprocs()
6

julia> rmprocs(3)
Task (done) @0x0000000126edb850

julia> nprocs()
5

and further changes do not seem to matter.

pmap seems so incredibly slow that it seems nearly unusable. this is not because of my example above (where, of course, the ops is small relative to the overhead), but because my “mental benchmark” is the equivalent slow R function that typically copies the entire Unix process (yes!) and still manages

> t <- Sys.time(); x=mclapply( 1:1000000, function(x) log(1.0+x) );
> print(Sys.time()-t)
Time difference of 0.5312 secs

in contrast julia took 50 seconds on pmap when multiple cores were available to processes, and 5 seconds when only 1 core was available (presumably, with the process internally falling back on not spawning processes).

tkoolen · February 3, 2018, 7:36am

Just FYI, Base.log1p exists.

Elrod · February 3, 2018, 8:27am

And it’s more accurate for small x.

julia> log(1+1e-17)
0.0

julia> log1p(1e-17)
1.0e-17

Tamas_Papp · February 3, 2018, 10:09am

And there is Base.Math.JuliaLibm.log1p, which is a bit better.

ivo_welch · February 4, 2018, 12:41am

deleted in favor of mohamed’s post.

mohamed82008 · February 4, 2018, 2:13am

Don’t time in global scope, use a function instead. And use @btime of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:

using BenchmarkTools

function simple_loop_sum()
    M = 1000000
    n = Vector{Float64}(M)
    for i=1:M; n[i] = log1p(i); end#for
    return sum(n)
end

function sharedarray_parallel_sum()
    M = 1000000
    a = SharedArray{Float64}(M)
    s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
    return sum(a)
end

function pmap_sum()
    M = 1000000
    r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
    return sum(r)
end

function sharedarray_mapreduce()
    M = 1000000
    a=SharedArray{Float64}(M)
    s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
    return s
end

function threads_sum()
    M = 1000000
    a=Vector{Float64}(M)
    Threads.@threads for i=1:M
        a[i]=log1p(i)
    end#for
    return sum(a)
end

println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())

@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)

The pmap version is really absurd but I am not sure why.

ChrisRackauckas · February 4, 2018, 2:59am

change the batch size

mohamed82008 · February 4, 2018, 5:56am

Ok now significantly less absurd:

function pmap_sum()
	M = 1000000
	r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
	return sum(r)
end
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)

ChrisRackauckas · February 4, 2018, 7:16am

I don’t think it should matter for this case, but try adding a caching pool.

louisponet · February 4, 2018, 7:57am

One thing I’ve been wondering about is the difference between mutithreading and parallel processing. Can someone explain to me when I should use the former when the latter, or point me to some? Up till now I’ve been just using threading because it seems like a simpler way of getting a performance increase. I tried doing some parallel calculations once, but that was severely lacking because I couldn’t define SharedArray for my own types, and I had around 400mb of data being used in each of the parallel processes/ threads.

ChrisRackauckas · February 4, 2018, 8:00am

Multiprocessing, which we’re calling parallel here, is not shared memory. It can be distributed across multiple computers (nodes of an HPC). You don’t want to use a SharedArray unless you have to: you should limit the amount of data that is being shared and be careful about exchanging data. But it scales to much larger problems since you can use thousands/millions/billions of cores.

louisponet · February 4, 2018, 8:05am

Right, That was kind of what I was thinking too, thanks for the clarification! Basically whenever I can partition my data in such a way that is a standalone portion of the final problem, parallel processing would be a good idea, whereas if all parts of the divisible problem need to same/full set of data, threading would be a better idea. Is that right?

ChrisRackauckas · February 4, 2018, 8:14am

Kind of. It’s more like, if you’re on a single shared memory machine (i.e. one computer) you should probably use multithreading. Anything else needs multiprocessing. (Though there can be some extra complications)

mohamed82008 · February 4, 2018, 9:47am

If that’s what you mean, it doesn’t really help.

function pmap_sum()
   M = 1000000
   r = pmap(CachingPool(workers()),log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
   return sum(r)
end
@btime pmap_sum()
#4.379 s (7014529 allocations: 181.59 MiB)

ivo_welch · February 4, 2018, 4:56pm

mohamed—I am changing the solution to your answer. if you see a “caching pool” improvement, please edit to add it to your previous answer. for now, can you please edit your earlier post to add the batch_size version to the plain one? regards, /iaw

ivo_welch · February 4, 2018, 5:11pm

there is something else I do not understand.

why is mohamed’s simple_loop_sum() slower than parallel other versions, specifically sharedarray_parallel_sum(), even with one processor (nprocs()==1)?? (I also confirmed it on my own computer.)

mohamed82008 · February 4, 2018, 11:27pm

I cannot observe what you are saying. These are the timings for 1 processor.

julia> @btime simple_loop_sum();
  12.584 ms (2 allocations: 7.63 MiB)

julia> @btime sharedarray_parallel_sum();
  14.997 ms (209 allocations: 10.00 KiB)

julia> @btime pmap_sum();
  2.154 s (7999569 allocations: 152.59 MiB)

julia> @btime sharedarray_mapreduce();
  15.281 ms (205 allocations: 9.72 KiB)

julia> @btime threads_sum()
  12.563 ms (3 allocations: 7.63 MiB)

Topic		Replies	Views
Poor speed gain using `pmap` Performance parallel , pmap	17	1192	August 6, 2021
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2629	December 1, 2021
Writing effective parallel code Performance parallel	8	1708	December 18, 2019
Parallel Good Practice Julia at Scale	22	3950	November 30, 2018
Parallel reductions Julia at Scale	23	5173	June 19, 2022

Simple Parallel Examples for Embarrassingly Simple Problems

Related topics