 # Parallel is very slow

#1

Hi all,

Is anyone can tell me why parallel is so slow?

``````@everywhere using DistributedArrays

n = parse(Int,ARGS)

println("One")
@time x =  [sin(i) + cos(i) for i = 1:n];

println("Two")
@time y = @DArray [sin(i) + cos(i) for i = 1:n];

println("Three")
@time z=@parallel vcat for i= 1:n
sin(i) + cos(i);
end;

println(isapprox(x, y))
println(isapprox(y, z))

``````

#2

timing in global scope is bad. You should put this code in a function. Also, you are timing the compilation time.

CSV Reading (rewrite in C?)
CSV Reading (rewrite in C?)
#3

More than any of that using multiple parallel processes to do a trivial computation like `sin(i) + cos(i)` is never going to give a speedup. The cost of sending data between processes dwarfs the cost of doing the computation. For distributed computing, the work done by the workers needs to be non-trivial in order to get any benefit.

#4

Thank you very much. I changed the test code, but the parallel still very slow.

``````@everywhere function acf(x)
N = length(x)
# zero pad x
x = [x; zeros(x)]

d = N*ones(N) - collect(0:N-1)
return real.(ifft( fft(x) .* conj(fft(x)))[1:N] ./ d)
end

function test_ser(dat)
x =  [acf(dat[1:end, i]) for i = 1:size(dat)];
return x
end

function test_par(dat)
z=@parallel vcat for i = 1:size(dat)
acf(dat[1:end, i])
end
return z
end

m = parse(Int,ARGS)
n = parse(Int,ARGS)

dat0 = rand(10, 3)
test_ser(dat0)
test_par(dat0)

dat = rand(m, n)

print("Ser: ")
@time x = test_ser(dat)
println()

print("Par: ")
@time z = test_ser(dat)
println()

print("x==z? ")
println(isapprox(x, z))

``````

#5

That doesn’t seem to match what the documentation says. See https://docs.julialang.org/en/stable/manual/parallel-computing/#Parallel-Map-and-Loops-1, last paragraph:

`@parallel for` can handle situations where each iteration is tiny, perhaps merely summing two numbers.

When I was trying out Julia’s parallel computing capabilities a few weeks ago, I was also surprised by a lack of performance, both with `@parallel` and with `@threads`. For `@threads`, it turned out to be completely due to the closure performance issue (see Parallelizing for loop in the computation of a gradient). I didn’t look into the details of `@parallel`, but does it also generate a closure?

#6

The documentation says “can handle” I don’t think that says “can do it faster”

#7

Ah come on, if ‘can handle’ were really used in that weak form, it should apply equally to `pmap`.

#9

FWIW, with `Threads.@threads` I do get a significant speedup on 0.7:

``````f1(n) = [sin(i) + cos(i) for i = 1:n]

function f2(n)
x = Vector{Float64}(uninitialized, n)
Threads.@threads for i = 1 : n
x[i] = sin(i) + cos(i)
end
x
end
``````
``````julia> using BenchmarkTools

julia> @btime f1(10000);
230.182 μs (2 allocations: 78.20 KiB)

julia> @btime f2(10000);
63.259 μs (3 allocations: 78.23 KiB)
``````

with `export JULIA_NUM_THREADS=4`.

#10

Note that you’re running `test_ser` twice here; you’re never timing `test_par`.

#11

`@parallel` statically partitions work across processes and thus inter-process communication is minimized (data is sent once at the beginning, result is received when finished).

`pmap` communicates with processes at each iteration.

So yes `@parallel` can handle iterations with small workload, but there is a “once off” cost that needs to be amortized over the whole computation. Thus you might need a large number of iterations to see the pay-off.

`pmap` on the other hand is not a good choice for small workloads because inter-process overhead occurs for each iteration. For computationally-heavy iterations, the advantage is that `pmap` uses dynamic load-balancing, which is particularly useful where the load cannot be evenly partitioned statically.

On my machine with 24 workers, here’s the speedup of calculating the sum of `f(x) = sin(x) + cos(x)` for x = 1:N for varying N, using `@parallel`. The parallel computation is slower up to a threshold of around n=10^5, after which the speedup becomes significant.

N Speedup
10^1 0.00030512
10^2 0.00194421
10^3 0.0180223
10^4 0.197917
10^5 1.69647
10^6 11.9257
10^7 18.2981
10^8 21.3556

Here’s the code:

``````addprocs(24)
@everywhere f(x) = sin(x) + cos(x)

function serial(n)
s = 0.0
for x = 1:n
s += f(x)
end
return s
end

function parallel(n)
@parallel (+) for x = 1:n
f(x)
end
end

using BenchmarkTools

N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)

for i = 1:N
n = 10^i
@assert isapprox(serial(n), parallel(n))
trial[i,1] = @benchmark serial(\$n)
trial[i,2] = @benchmark parallel(\$n)
times[i,1] = median(trial[i,1].times)
times[i,2] = median(trial[i,2].times)
ratios[i] = times[i,1] / times[i,2]
end

println(ratios)
``````

#12

Thanks for the writeup. I do wonder though why the constant factor is so large for `@parallel`. With `@threads`, the constant cost is clearly much smaller.

#13

Nice! I assume you have 24 physical cores, 48 logical?

``````julia> nprocs()
17

julia> println(ratios)
[0.000209611, 0.00129174, 0.0141181, 0.15285, 1.37761, 4.99001, 9.27346, 13.9483]
``````
``````julia> nprocs()
33

julia> println(ratios)
[0.000225118, 0.00054258, 0.00585587, 0.0602492, 0.639873, 4.08709, 12.308, 16.3199]
``````

Single core did best until 10^5, where 16 cores hit 1.38 (versus your 1.7), while 32 was stuck at 0.64. At 10^7, the 32 beat it, and actually peaked at a ratio higher than the number of logical cores.

My assumption was because your pattern looked similar to my `addprocs(16)`.

#14

Correct. I should have stated that.

#15

``````using Base.Threads, BenchmarkTools

f(x) = sin(x) + cos(x)

function serial(n)
s = 0.0
for x = 1:n
s += f(x)
end
return s
end

res_vec = Vector{Float64}(uninitialized, nthreads())
end
sum(res_vec)
end
function local_sum(id, n, nthread)
out = 0.0
l = 1 + div(n * (id-1), nthread)
u = div(n * id, nthread)
for x ∈ l:u
out += f(x)
end
out
end

N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)

for i = 1:N
n = 10^i
trial[i,1] = @benchmark serial(\$n)
trial[i,2] = @benchmark threads(\$n)
times[i,1] = median(trial[i,1].times)
times[i,2] = median(trial[i,2].times)
ratios[i] = times[i,1] / times[i,2]
end

println(ratios)
``````
``````[0.0346708, 0.350999, 2.9688, 12.7841, 16.0687, 17.1916, 16.1807, 18.0311]
``````

This is much better than what I got for parallelism:

``````julia> println(ratios)
[0.000225118, 0.00054258, 0.00585587, 0.0602492, 0.639873, 4.08709, 12.308, 16.3199]
``````

#16

Yeah, exactly.

You really shouldn’t have to manually split up the range with `@threads` though, the macro does that for you. Sure, there is this argument for splitting up manually: Parallelizing for loop in the computation of a gradient, but in this particular case I don’t think there’s any inference issue, at least on 0.7.

#17

I split it up because there is no reduction option like for the parallel loop, and I didn’t want to mess with atomics or worry about false sharing.

Innefficient paralellization? Need some help optimizing a simple dot product
#18

Because `@parallel` uses inter-process communication (IPC), which is vastly more expensive than communication between shared-memory threads.

The flip side is that IPC is much more scalable, since if you have hundreds or thousands of processors in a cluster, supercomputer, or cloud server, there is no efficient way for them to implement shared memory.