Parallel is very slow

parallel

#1

Hi all,

Is anyone can tell me why parallel is so slow?

@everywhere using DistributedArrays

n = parse(Int,ARGS[1]) 

println("One")
@time x =  [sin(i) + cos(i) for i = 1:n];

println("Two")
@time y = @DArray [sin(i) + cos(i) for i = 1:n];

println("Three")
@time z=@parallel vcat for i= 1:n
    sin(i) + cos(i);
end;

println(isapprox(x, y))
println(isapprox(y, z))


#2

timing in global scope is bad. You should put this code in a function. Also, you are timing the compilation time.


#3

More than any of that using multiple parallel processes to do a trivial computation like sin(i) + cos(i) is never going to give a speedup. The cost of sending data between processes dwarfs the cost of doing the computation. For distributed computing, the work done by the workers needs to be non-trivial in order to get any benefit.


#4

Thank you very much. I changed the test code, but the parallel still very slow.

@everywhere function acf(x)
    N = length(x)
    # zero pad x
    x = [x; zeros(x)]

    d = N*ones(N) - collect(0:N-1)
    return real.(ifft( fft(x) .* conj(fft(x)))[1:N] ./ d)
end

function test_ser(dat)
    x =  [acf(dat[1:end, i]) for i = 1:size(dat)[2]];
    return x
end

function test_par(dat)
    z=@parallel vcat for i = 1:size(dat)[2]
        acf(dat[1:end, i])
    end
    return z 
end

m = parse(Int,ARGS[1]) 
n = parse(Int,ARGS[2]) 

dat0 = rand(10, 3)
test_ser(dat0)
test_par(dat0)

dat = rand(m, n)
 
print("Ser: ")
@time x = test_ser(dat)
println()

print("Par: ")
@time z = test_ser(dat)
println()

print("x==z? ")
println(isapprox(x, z))


#5

That doesn’t seem to match what the documentation says. See https://docs.julialang.org/en/stable/manual/parallel-computing/#Parallel-Map-and-Loops-1, last paragraph:

@parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

When I was trying out Julia’s parallel computing capabilities a few weeks ago, I was also surprised by a lack of performance, both with @parallel and with @threads. For @threads, it turned out to be completely due to the closure performance issue (see Parallelizing for loop in the computation of a gradient). I didn’t look into the details of @parallel, but does it also generate a closure?


#6

The documentation says “can handle” I don’t think that says “can do it faster”


#7

Ah come on, if ‘can handle’ were really used in that weak form, it should apply equally to pmap.


#9

FWIW, with Threads.@threads I do get a significant speedup on 0.7:

f1(n) = [sin(i) + cos(i) for i = 1:n]

function f2(n)
    x = Vector{Float64}(uninitialized, n)
    Threads.@threads for i = 1 : n
        x[i] = sin(i) + cos(i)
    end
    x
end
julia> using BenchmarkTools

julia> @btime f1(10000);
  230.182 μs (2 allocations: 78.20 KiB)

julia> @btime f2(10000);
  63.259 μs (3 allocations: 78.23 KiB)

with export JULIA_NUM_THREADS=4.


#10

Note that you’re running test_ser twice here; you’re never timing test_par.


#11

@parallel statically partitions work across processes and thus inter-process communication is minimized (data is sent once at the beginning, result is received when finished).

pmap communicates with processes at each iteration.

So yes @parallel can handle iterations with small workload, but there is a “once off” cost that needs to be amortized over the whole computation. Thus you might need a large number of iterations to see the pay-off.

pmap on the other hand is not a good choice for small workloads because inter-process overhead occurs for each iteration. For computationally-heavy iterations, the advantage is that pmap uses dynamic load-balancing, which is particularly useful where the load cannot be evenly partitioned statically.

On my machine with 24 workers, here’s the speedup of calculating the sum of f(x) = sin(x) + cos(x) for x = 1:N for varying N, using @parallel. The parallel computation is slower up to a threshold of around n=10^5, after which the speedup becomes significant.

N Speedup
10^1 0.00030512
10^2 0.00194421
10^3 0.0180223
10^4 0.197917
10^5 1.69647
10^6 11.9257
10^7 18.2981
10^8 21.3556

Here’s the code:

addprocs(24)
@everywhere f(x) = sin(x) + cos(x)

function serial(n)
    s = 0.0
    for x = 1:n
        s += f(x)
    end
    return s
end

function parallel(n)
    @parallel (+) for x = 1:n
        f(x)
    end
end

using BenchmarkTools

N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)

for i = 1:N
    n = 10^i
    @assert isapprox(serial(n), parallel(n))
    trial[i,1] = @benchmark serial($n)
    trial[i,2] = @benchmark parallel($n)
    times[i,1] = median(trial[i,1].times)
    times[i,2] = median(trial[i,2].times)
    ratios[i] = times[i,1] / times[i,2]
end

println(ratios)

#12

Thanks for the writeup. I do wonder though why the constant factor is so large for @parallel. With @threads, the constant cost is clearly much smaller.


#13

Nice! I assume you have 24 physical cores, 48 logical?

With a 16-core threadripper (32 threads):

julia> nprocs()
17

julia> println(ratios)
[0.000209611, 0.00129174, 0.0141181, 0.15285, 1.37761, 4.99001, 9.27346, 13.9483]
julia> nprocs()
33

julia> println(ratios)
[0.000225118, 0.00054258, 0.00585587, 0.0602492, 0.639873, 4.08709, 12.308, 16.3199]

Single core did best until 10^5, where 16 cores hit 1.38 (versus your 1.7), while 32 was stuck at 0.64. At 10^7, the 32 beat it, and actually peaked at a ratio higher than the number of logical cores.

My assumption was because your pattern looked similar to my addprocs(16).


#14

Correct. I should have stated that.


#15

With 32 threads:

using Base.Threads, BenchmarkTools

 f(x) = sin(x) + cos(x)

function serial(n)
    s = 0.0
    for x = 1:n
        s += f(x)
    end
    return s
end

function threads(n)
    res_vec = Vector{Float64}(uninitialized, nthreads())
    @threads for i ∈ 1:nthreads()
        res_vec[i] = local_sum(threadid(), n, nthreads())
    end
    sum(res_vec)
end
function local_sum(id, n, nthread)
    out = 0.0
    l = 1 + div(n * (id-1), nthread)
    u = div(n * id, nthread)
    for x ∈ l:u
        out += f(x)
    end
    out
end

N = 8
trial = Array{BenchmarkTools.Trial}(N,2)
times = Array{Float64}(N,2)
ratios = Array{Float64}(N)

for i = 1:N
    n = 10^i
    @assert isapprox(serial(n), threads(n))
    trial[i,1] = @benchmark serial($n)
    trial[i,2] = @benchmark threads($n)
    times[i,1] = median(trial[i,1].times)
    times[i,2] = median(trial[i,2].times)
    ratios[i] = times[i,1] / times[i,2]
end

println(ratios)
[0.0346708, 0.350999, 2.9688, 12.7841, 16.0687, 17.1916, 16.1807, 18.0311]

This is much better than what I got for parallelism:

julia> println(ratios)
[0.000225118, 0.00054258, 0.00585587, 0.0602492, 0.639873, 4.08709, 12.308, 16.3199]

#16

Yeah, exactly.

You really shouldn’t have to manually split up the range with @threads though, the macro does that for you. Sure, there is this argument for splitting up manually: Parallelizing for loop in the computation of a gradient, but in this particular case I don’t think there’s any inference issue, at least on 0.7.


#17

I split it up because there is no reduction option like for the parallel loop, and I didn’t want to mess with atomics or worry about false sharing.


Innefficient paralellization? Need some help optimizing a simple dot product
#18

Because @parallel uses inter-process communication (IPC), which is vastly more expensive than communication between shared-memory threads.

The flip side is that IPC is much more scalable, since if you have hundreds or thousands of processors in a cluster, supercomputer, or cloud server, there is no efficient way for them to implement shared memory.