@everywhere using DistributedArrays
n = parse(Int,ARGS[1])
println("One")
@time x = [sin(i) + cos(i) for i = 1:n];
println("Two")
@time y = @DArray [sin(i) + cos(i) for i = 1:n];
println("Three")
@time z=@parallel vcat for i= 1:n
sin(i) + cos(i);
end;
println(isapprox(x, y))
println(isapprox(y, z))

More than any of that using multiple parallel processes to do a trivial computation like sin(i) + cos(i) is never going to give a speedup. The cost of sending data between processes dwarfs the cost of doing the computation. For distributed computing, the work done by the workers needs to be non-trivial in order to get any benefit.

Thank you very much. I changed the test code, but the parallel still very slow.

@everywhere function acf(x)
N = length(x)
# zero pad x
x = [x; zeros(x)]
d = N*ones(N) - collect(0:N-1)
return real.(ifft( fft(x) .* conj(fft(x)))[1:N] ./ d)
end
function test_ser(dat)
x = [acf(dat[1:end, i]) for i = 1:size(dat)[2]];
return x
end
function test_par(dat)
z=@parallel vcat for i = 1:size(dat)[2]
acf(dat[1:end, i])
end
return z
end
m = parse(Int,ARGS[1])
n = parse(Int,ARGS[2])
dat0 = rand(10, 3)
test_ser(dat0)
test_par(dat0)
dat = rand(m, n)
print("Ser: ")
@time x = test_ser(dat)
println()
print("Par: ")
@time z = test_ser(dat)
println()
print("x==z? ")
println(isapprox(x, z))

@parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

When I was trying out Juliaâs parallel computing capabilities a few weeks ago, I was also surprised by a lack of performance, both with @parallel and with @threads. For @threads, it turned out to be completely due to the closure performance issue (see Parallelizing for loop in the computation of a gradient - #7 by tkoolen). I didnât look into the details of @parallel, but does it also generate a closure?

FWIW, with Threads.@threads I do get a significant speedup on 0.7:

f1(n) = [sin(i) + cos(i) for i = 1:n]
function f2(n)
x = Vector{Float64}(uninitialized, n)
Threads.@threads for i = 1 : n
x[i] = sin(i) + cos(i)
end
x
end

@parallel statically partitions work across processes and thus inter-process communication is minimized (data is sent once at the beginning, result is received when finished).

pmap communicates with processes at each iteration.

So yes @parallel can handle iterations with small workload, but there is a âonce offâ cost that needs to be amortized over the whole computation. Thus you might need a large number of iterations to see the pay-off.

pmap on the other hand is not a good choice for small workloads because inter-process overhead occurs for each iteration. For computationally-heavy iterations, the advantage is that pmap uses dynamic load-balancing, which is particularly useful where the load cannot be evenly partitioned statically.

On my machine with 24 workers, hereâs the speedup of calculating the sum of f(x) = sin(x) + cos(x) for x = 1:N for varying N, using @parallel. The parallel computation is slower up to a threshold of around n=10^5, after which the speedup becomes significant.

N

Speedup

10^1

0.00030512

10^2

0.00194421

10^3

0.0180223

10^4

0.197917

10^5

1.69647

10^6

11.9257

10^7

18.2981

10^8

21.3556

Hereâs the code:

addprocs(24)
@everywhere f(x) = sin(x) + cos(x)
function serial(n)
s = 0.0
for x = 1:n
s += f(x)
end
return s
end
function parallel(n)
@parallel (+) for x = 1:n
f(x)
end
end
using BenchmarkTools
N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)
for i = 1:N
n = 10^i
@assert isapprox(serial(n), parallel(n))
trial[i,1] = @benchmark serial($n)
trial[i,2] = @benchmark parallel($n)
times[i,1] = median(trial[i,1].times)
times[i,2] = median(trial[i,2].times)
ratios[i] = times[i,1] / times[i,2]
end
println(ratios)

Thanks for the writeup. I do wonder though why the constant factor is so large for @parallel. With @threads, the constant cost is clearly much smaller.

Single core did best until 10^5, where 16 cores hit 1.38 (versus your 1.7), while 32 was stuck at 0.64. At 10^7, the 32 beat it, and actually peaked at a ratio higher than the number of logical cores.

My assumption was because your pattern looked similar to my addprocs(16).

using Base.Threads, BenchmarkTools
f(x) = sin(x) + cos(x)
function serial(n)
s = 0.0
for x = 1:n
s += f(x)
end
return s
end
function threads(n)
res_vec = Vector{Float64}(uninitialized, nthreads())
@threads for i â 1:nthreads()
res_vec[i] = local_sum(threadid(), n, nthreads())
end
sum(res_vec)
end
function local_sum(id, n, nthread)
out = 0.0
l = 1 + div(n * (id-1), nthread)
u = div(n * id, nthread)
for x â l:u
out += f(x)
end
out
end
N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)
for i = 1:N
n = 10^i
@assert isapprox(serial(n), threads(n))
trial[i,1] = @benchmark serial($n)
trial[i,2] = @benchmark threads($n)
times[i,1] = median(trial[i,1].times)
times[i,2] = median(trial[i,2].times)
ratios[i] = times[i,1] / times[i,2]
end
println(ratios)

You really shouldnât have to manually split up the range with @threads though, the macro does that for you. Sure, there is this argument for splitting up manually: Parallelizing for loop in the computation of a gradient - #18 by saschatimme, but in this particular case I donât think thereâs any inference issue, at least on 0.7.

The flip side is that IPC is much more scalable, since if you have hundreds or thousands of processors in a cluster, supercomputer, or cloud server, there is no efficient way for them to implement shared memory.