Parallel and distributed very slow

Ludovic_Dumoulin · October 14, 2021, 9:31am

Hi !
I am trying to learn how to distribute or use parellel computing on CPU, I don’t understand why, it is slower that a sequential for… As I really do not understand why, I ask for your help, thank you

using BenchmarkTools
using Test

ADD on CPU

N = 2^20
x = fill(1.0, N)
y = fill(2.0, N);

y .+= x
@test all(y .== 3.0)

e[32me[1mTest Passede[22me[39m

function sequential_add!(y, x) #add x to y
    for i in eachindex(y,x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y,x)
        @inbounds y[i] += x[i]
    end
    return nothing
end
Threads.nthreads()

using Distributed
nprocs() == 1 && addprocs()

@everywhere using SharedArrays
@everywhere begin
    using Test
    using BenchmarkTools
end

y_shar = SharedArray{Float64}(N)
y_shar .= 2.0

function distributed_add!(y,x)
    @sync @distributed for i in 1:length(x) 
        @inbounds y[i] += x[i]
    end
    return nothing
end

distributed_add! (generic function with 1 method)

y_shar .= 2.0
distributed_add!(y_shar, x)
@test all(y_shar .== 3.0)

e[32me[1mTest Passede[22me[39m

fill!(y,2.0)
sequential_add!(y, x)
@test all(y .== 3.0)

fill!(y,2.0)
parallel_add!(y,x)
@test all(y .== 3.0)

e[32me[1mTest Passede[22me[39m

function add_cpu_bench!(y,x)
    y .+= x
    return nothing
end

add_cpu_bench! (generic function with 1 method)

fill!(y,2.0)
@btime add_cpu_bench!($y,$x)

  610.000 μs (0 allocations: 0 bytes)

fill!(y,2.0)
@btime sequential_add!($y,$x)

  611.100 μs (0 allocations: 0 bytes)

fill!(y,2.0)
@btime parallel_add!($y,$x)

  608.800 μs (40 allocations: 4.11 KiB)

y_shar .= 2.0
@benchmark distributed_add!($y_shar,$x)

28.836 ms (1288 allocations: 60.72 KiB)

albheim · October 14, 2021, 10:14am

I haven’t used SharedArrays before, but my guess is that since the individual calculations you are doing (a single add) are very small, the overhead becomes larger than the gain from running it in parallel.

As to why the other benchmarks are so similar it seems reasonable for the dotted (add_cpu_bench) and the sequential, since dotted is basically creating the loop in the background (see here). Why the threaded is the same I’m not sure, maybe just randomly? Trying it I can see that the threaded uses all cores fully, while the other just loads one core. But the timing is very similar for me also.

Ludovic_Dumoulin · October 14, 2021, 10:41am

Thank you for your quick answer,
I was expecting different benchmark for threaded and sequential add because I am following this introduction: Introduction · CUDA.jl and have a i7-7700 with 8 threads.

feanor12 · October 14, 2021, 10:58am

I am not sure, but fill might try to be clever using a single value for all entries.
Does the timing change when you use an array of random numbers?

jishnub · October 14, 2021, 11:58am

It’s possible that the performance is affected by false sharing. You’ll need to make sure that each thread access contiguous sections of the data by splitting the entire array into contiguous subarrays that are passed to each thread, or by working on the local part of a SharedArray on each process (the distributed for loop might require processes to send data to each other, which is expensive).

Ludovic_Dumoulin · October 14, 2021, 12:22pm

I tried to remove fill and used random array, nothing changes

Ludovic_Dumoulin · October 14, 2021, 12:36pm

Thank you, unfortunately I have no idea of how I can do that, so I tried

@show procs(y_shar)

it return

procs(y_shar) = [2, 3, 4, 5, 6, 7, 8, 9]

then in the for of my function distributed_add!() I tried

println(indexpids(y))

and it gives me that:

      From worker 3:	2
      From worker 2:	1
      From worker 5:	4
      From worker 3:	2
      From worker 2:	1
      From worker 5:	4
      From worker 3:	2
      From worker 2:	1
      From worker 3:	2
      From worker 5:	4
      From worker 2:	1
      From worker 2:	1
      From worker 3:	2
      From worker 3:	2
      From worker 3:	2
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1

I have no idea of what I am doing !

EDIT:
I found something interesting here: Multi-processing and Distributed Computing · The Julia Language

function advection_shared!(q, u) I will check, thank you.
I am a bit disappointed of Distributed, I was thinking that it will be easier ^^".

Topic		Replies	Views
Using Distributed: computational efficiency Julia at Scale	5	950	June 24, 2019
The sum of an array is faster in the sequential version than in the parallelized version Julia at Scale	9	1311	October 11, 2018
Individual threads superfast, while sync really slow with virtually no communication Performance	4	205	January 24, 2024
First post - what am I doing wrong - distributed benchmark show no improvement New to Julia question , benchmark , distributed	2	352	September 22, 2021
Huge distributed overhead Performance question , parallel , memory-allocation , distributed , sharedarrays	2	190	June 19, 2024

Parallel and distributed very slow

ADD on CPU

Related topics