Simple Parallel Examples for Embarrassingly Simple Problems

thank you, pasha. indeed, with the @everywhere in front of logp1, it works. (of course, the obvious log(1.0+1:M) was the way to calculate this; I just needed some function for test purposes.)

thank you, ksmcreynolds. the @sync before the @parallel solves the collection problem. and addprocs(Int) allows dynamic use of number of processors. for reference, please see corrected example below.

ksmcreynolds: one strange aspect is that it

julia> nprocs()

julia> rmprocs(3)
Task (done) @0x0000000126edb850

julia> nprocs()

and further changes do not seem to matter.

pmap seems so incredibly slow that it seems nearly unusable. this is not because of my example above (where, of course, the ops is small relative to the overhead), but because my “mental benchmark” is the equivalent slow R function that typically copies the entire Unix process (yes!) and still manages

> t <- Sys.time(); x=mclapply( 1:1000000, function(x) log(1.0+x) );
> print(Sys.time()-t)
Time difference of 0.5312 secs

in contrast julia took 50 seconds on pmap when multiple cores were available to processes, and 5 seconds when only 1 core was available (presumably, with the process internally falling back on not spawning processes).

Just FYI, Base.log1p exists.

1 Like

And it’s more accurate for small x.

julia> log(1+1e-17)

julia> log1p(1e-17)

And there is Base.Math.JuliaLibm.log1p, which is a bit better.

1 Like

deleted in favor of mohamed’s post.

Don’t time in global scope, use a function instead. And use @btime of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:

using BenchmarkTools

function simple_loop_sum()
    M = 1000000
    n = Vector{Float64}(M)
    for i=1:M; n[i] = log1p(i); end#for
    return sum(n)

function sharedarray_parallel_sum()
    M = 1000000
    a = SharedArray{Float64}(M)
    s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
    return sum(a)

function pmap_sum()
    M = 1000000
    r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
    return sum(r)

function sharedarray_mapreduce()
    M = 1000000
    s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
    return s

function threads_sum()
    M = 1000000
    Threads.@threads for i=1:M
    return sum(a)

println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())

@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)

The pmap version is really absurd but I am not sure why.

1 Like

change the batch size

1 Like

Ok now significantly less absurd:

function pmap_sum()
	M = 1000000
	r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
	return sum(r)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
1 Like

I don’t think it should matter for this case, but try adding a caching pool.

One thing I’ve been wondering about is the difference between mutithreading and parallel processing. Can someone explain to me when I should use the former when the latter, or point me to some? Up till now I’ve been just using threading because it seems like a simpler way of getting a performance increase. I tried doing some parallel calculations once, but that was severely lacking because I couldn’t define SharedArray for my own types, and I had around 400mb of data being used in each of the parallel processes/ threads.

Multiprocessing, which we’re calling parallel here, is not shared memory. It can be distributed across multiple computers (nodes of an HPC). You don’t want to use a SharedArray unless you have to: you should limit the amount of data that is being shared and be careful about exchanging data. But it scales to much larger problems since you can use thousands/millions/billions of cores.

Right, That was kind of what I was thinking too, thanks for the clarification! Basically whenever I can partition my data in such a way that is a standalone portion of the final problem, parallel processing would be a good idea, whereas if all parts of the divisible problem need to same/full set of data, threading would be a better idea. Is that right?

Kind of. It’s more like, if you’re on a single shared memory machine (i.e. one computer) you should probably use multithreading. Anything else needs multiprocessing. (Though there can be some extra complications)

1 Like

If that’s what you mean, it doesn’t really help.

function pmap_sum()
   M = 1000000
   r = pmap(CachingPool(workers()),log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
   return sum(r)
@btime pmap_sum()
#4.379 s (7014529 allocations: 181.59 MiB)

mohamed—I am changing the solution to your answer. if you see a “caching pool” improvement, please edit to add it to your previous answer. for now, can you please edit your earlier post to add the batch_size version to the plain one? regards, /iaw

there is something else I do not understand.

why is mohamed’s simple_loop_sum() slower than parallel other versions, specifically sharedarray_parallel_sum(), even with one processor (nprocs()==1)?? (I also confirmed it on my own computer.)

I cannot observe what you are saying. These are the timings for 1 processor.

julia> @btime simple_loop_sum();
  12.584 ms (2 allocations: 7.63 MiB)

julia> @btime sharedarray_parallel_sum();
  14.997 ms (209 allocations: 10.00 KiB)

julia> @btime pmap_sum();
  2.154 s (7999569 allocations: 152.59 MiB)

julia> @btime sharedarray_mapreduce();
  15.281 ms (205 allocations: 9.72 KiB)

julia> @btime threads_sum()
  12.563 ms (3 allocations: 7.63 MiB)

I stand corrected. I can no longer replicate it, either. how strange. it was not a read error. oh well, let’s just ignore. thx, m.

rmprocs(3) removes processor number 3 not three processors. Before you have processors 1 through 6. After you have processors 1, 2, 4, 5, & 6 which is a total of 5.

1 Like

My output is below. on my machine, threads have no effect:

| Method | 1 nproc | 2 nproc | 4 nproc | 8 nproc | comments |
| Non Parallel | 0.011 | 0.011 | 0.011 | 0.011 | as expected, constant |
| Shared Array | 0.014 | 0.011 | 0.008 | 0.008 | good parallelism |
| Shared Array, Mapreduce | 0.014 | 0.008 | 0.005 | 0.004 | excellent parallelism |
| Threads | 0.011 | 0.011 | 0.011 | 0.011 | no effect of parallelism |
| Pmap (Default) | 3.968 | 52.120 | 45.869 | 46.113 | from bad to worse |
| Pmap (Batch_Size) | 3.932 | 4.269 | 4.544 | 4.382 | from bad to bad |

and with a longer function

## const M = 100
## @everywhere function longerfun(x::Int64)::Float64 xs=0.0; for o=1:(500^2); xs+= log(1.0+x); end; sqrt(xs); end#function

| Method | 1 nproc | 2 nproc | 4 nproc | 8 nproc | comments |
| Non Parallel | 197 | 197 | 197 | 1971 | as expected, constant |
| Shared Array | 197 | 102 | 55 | 31 | good parallelism |
| Shared Array, Mapreduce | 197 | 102 | 55 | 30 | goodparallelism |
| Threads | 197 | 197 | 197 | 197 | no effect of parallelism |
| Pmap (Default) | 198 | 112 | 59 | 30 | good parallelism |
| Pmap (Batch_Size) | 198 | 104 | 55 | 29 | good parallelism |

which I will also keep at .

Does anyone know why threads has no parallel features?