thank you, pasha. indeed, with the @everywhere in front of logp1, it works. (of course, the obvious log(1.0+1:M) was the way to calculate this; I just needed some function for test purposes.)
thank you, ksmcreynolds. the @sync before the @parallel solves the collection problem. and addprocs(Int) allows dynamic use of number of processors. for reference, please see corrected example below.
pmap seems so incredibly slow that it seems nearly unusable. this is not because of my example above (where, of course, the ops is small relative to the overhead), but because my “mental benchmark” is the equivalent slow R function that typically copies the entire Unix process (yes!) and still manages
> t <- Sys.time(); x=mclapply( 1:1000000, function(x) log(1.0+x) );
> print(Sys.time()-t)
Time difference of 0.5312 secs
in contrast julia took 50 seconds on pmap when multiple cores were available to processes, and 5 seconds when only 1 core was available (presumably, with the process internally falling back on not spawning processes).
Don’t time in global scope, use a function instead. And use @btime of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:
using BenchmarkTools
function simple_loop_sum()
M = 1000000
n = Vector{Float64}(M)
for i=1:M; n[i] = log1p(i); end#for
return sum(n)
end
function sharedarray_parallel_sum()
M = 1000000
a = SharedArray{Float64}(M)
s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
return sum(a)
end
function pmap_sum()
M = 1000000
r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
function sharedarray_mapreduce()
M = 1000000
a=SharedArray{Float64}(M)
s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
return s
end
function threads_sum()
M = 1000000
a=Vector{Float64}(M)
Threads.@threads for i=1:M
a[i]=log1p(i)
end#for
return sum(a)
end
println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())
@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)
The pmap version is really absurd but I am not sure why.
function pmap_sum()
M = 1000000
r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
One thing I’ve been wondering about is the difference between mutithreading and parallel processing. Can someone explain to me when I should use the former when the latter, or point me to some? Up till now I’ve been just using threading because it seems like a simpler way of getting a performance increase. I tried doing some parallel calculations once, but that was severely lacking because I couldn’t define SharedArray for my own types, and I had around 400mb of data being used in each of the parallel processes/ threads.
Multiprocessing, which we’re calling parallel here, is not shared memory. It can be distributed across multiple computers (nodes of an HPC). You don’t want to use a SharedArray unless you have to: you should limit the amount of data that is being shared and be careful about exchanging data. But it scales to much larger problems since you can use thousands/millions/billions of cores.
Right, That was kind of what I was thinking too, thanks for the clarification! Basically whenever I can partition my data in such a way that is a standalone portion of the final problem, parallel processing would be a good idea, whereas if all parts of the divisible problem need to same/full set of data, threading would be a better idea. Is that right?
Kind of. It’s more like, if you’re on a single shared memory machine (i.e. one computer) you should probably use multithreading. Anything else needs multiprocessing. (Though there can be some extra complications)
function pmap_sum()
M = 1000000
r = pmap(CachingPool(workers()),log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
@btime pmap_sum()
#4.379 s (7014529 allocations: 181.59 MiB)
mohamed—I am changing the solution to your answer. if you see a “caching pool” improvement, please edit to add it to your previous answer. for now, can you please edit your earlier post to add the batch_size version to the plain one? regards, /iaw
why is mohamed’s simple_loop_sum() slower than parallel other versions, specifically sharedarray_parallel_sum(), even with one processor (nprocs()==1)?? (I also confirmed it on my own computer.)
rmprocs(3) removes processor number 3 not three processors. Before you have processors 1 through 6. After you have processors 1, 2, 4, 5, & 6 which is a total of 5.