Parallel and distributed very slow

Hi !
I am trying to learn how to distribute or use parellel computing on CPU, I don’t understand why, it is slower that a sequential for… As I really do not understand why, I ask for your help, thank you

using BenchmarkTools
using Test

ADD on CPU

N = 2^20
x = fill(1.0, N)
y = fill(2.0, N);
y .+= x
@test all(y .== 3.0)
e[32me[1mTest Passede[22me[39m
function sequential_add!(y, x) #add x to y
    for i in eachindex(y,x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y,x)
        @inbounds y[i] += x[i]
    end
    return nothing
end
Threads.nthreads()
8
using Distributed
nprocs() == 1 && addprocs()

@everywhere using SharedArrays
@everywhere begin
    using Test
    using BenchmarkTools
end
y_shar = SharedArray{Float64}(N)
y_shar .= 2.0

function distributed_add!(y,x)
    @sync @distributed for i in 1:length(x) 
        @inbounds y[i] += x[i]
    end
    return nothing
end
distributed_add! (generic function with 1 method)
y_shar .= 2.0
distributed_add!(y_shar, x)
@test all(y_shar .== 3.0)
e[32me[1mTest Passede[22me[39m
fill!(y,2.0)
sequential_add!(y, x)
@test all(y .== 3.0)

fill!(y,2.0)
parallel_add!(y,x)
@test all(y .== 3.0)
e[32me[1mTest Passede[22me[39m
function add_cpu_bench!(y,x)
    y .+= x
    return nothing
end
add_cpu_bench! (generic function with 1 method)
fill!(y,2.0)
@btime add_cpu_bench!($y,$x)
  610.000 μs (0 allocations: 0 bytes)
fill!(y,2.0)
@btime sequential_add!($y,$x)
  611.100 μs (0 allocations: 0 bytes)
fill!(y,2.0)
@btime parallel_add!($y,$x)
  608.800 μs (40 allocations: 4.11 KiB)
y_shar .= 2.0
@benchmark distributed_add!($y_shar,$x)

28.836 ms (1288 allocations: 60.72 KiB)

I haven’t used SharedArrays before, but my guess is that since the individual calculations you are doing (a single add) are very small, the overhead becomes larger than the gain from running it in parallel.

As to why the other benchmarks are so similar it seems reasonable for the dotted (add_cpu_bench) and the sequential, since dotted is basically creating the loop in the background (see here). Why the threaded is the same I’m not sure, maybe just randomly? Trying it I can see that the threaded uses all cores fully, while the other just loads one core. But the timing is very similar for me also.

Thank you for your quick answer,
I was expecting different benchmark for threaded and sequential add because I am following this introduction: Introduction · CUDA.jl and have a i7-7700 with 8 threads.

I am not sure, but fill might try to be clever using a single value for all entries.
Does the timing change when you use an array of random numbers?

It’s possible that the performance is affected by false sharing. You’ll need to make sure that each thread access contiguous sections of the data by splitting the entire array into contiguous subarrays that are passed to each thread, or by working on the local part of a SharedArray on each process (the distributed for loop might require processes to send data to each other, which is expensive).

I tried to remove fill and used random array, nothing changes

Thank you, unfortunately I have no idea of how I can do that, so I tried

@show procs(y_shar)

it return

procs(y_shar) = [2, 3, 4, 5, 6, 7, 8, 9]

then in the for of my function distributed_add!() I tried

println(indexpids(y))

and it gives me that:

      From worker 3:	2
      From worker 2:	1
      From worker 5:	4
      From worker 3:	2
      From worker 2:	1
      From worker 5:	4
      From worker 3:	2
      From worker 2:	1
      From worker 3:	2
      From worker 5:	4
      From worker 2:	1
      From worker 2:	1
      From worker 3:	2
      From worker 3:	2
      From worker 3:	2
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 5:	4
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1
      From worker 2:	1

I have no idea of what I am doing !

EDIT:
I found something interesting here: Multi-processing and Distributed Computing · The Julia Language

function advection_shared!(q, u) I will check, thank you.
I am a bit disappointed of Distributed, I was thinking that it will be easier ^^".