# The sum of an array is faster in the sequential version than in the parallelized version

I have an i7 with 8 logical process but the sum of an array is faster in the sequential versione than in the parallel. Why? Am I doing something wrong? Thanks for the answers.

``````using Distributed

x = rand(1000000)
#println(x)

dim = size(x)
println(dim)

t = @elapsed begin
let
cont = 0
dim = size(x)
println(dim)
for i in 1:dim

cont = cont + x[i]

end
println(cont)
end
end
println(t)

if (nworkers() != 4)
end

t1 = @elapsed begin

cont1 = @distributed (+) for i in 1:size(x)
x[i]
end
println(cont1)

end
println(t1)

``````

Parallelizing imposes a big communications overhead, which is especially huge in this example because your array `x` is all stored on the main process and the contents `x[i]` have to be sent to the subprocesses.

1 Like

Thanks for the answer, I tried to use a SharedArray but is the same thing. Any suggestion?

``````using Distributed, SharedArrays

x = rand(1000000)
#println(x)

dim = size(x)
println(dim)

t = @elapsed begin
let
cont = 0
dim = size(x)
println(dim)
for i in 1:dim

cont = cont + x[i]

end
println(cont)
end
end
println(t)

if (nworkers() != 4)
end

x1 = SharedArray(x)

t1 = @elapsed begin

cont1 = @distributed (+) for i in 1:size(x1)
x1[i]
end
println(cont1)

end
println(t1)
``````

`sum(x)` is simd accelerated, and therefore already ~4x faster than a real serial sum.
So the first challenge is to preserve simd acceleration in your reduction. The second challenge is, that even with ~10^7 element arrays, a sum only takes around ~3.5ms - a low number that can easily be eaten up any multi threading overhead.
This sum is a bit faster (1.4x) than Base.sum:

``````function threadsum(x::AbstractArray{T}) where T
result = similar(x, N)
segment = floor(Int, length(x) / N)
accum = zero(T)
start = (id - 1) * segment + 1
@simd for i in start:(start + segment - 1)
@inbounds (accum += x[i])
end
@inbounds result[id] = accum
end
res = sum(result)
if segment * N != length(x)
res += sum(view(x, (segment * N + 1):length(x)))
end
res
end
``````
1 Like

Your variable `cont` changes type from `Int` to `Float64` and is thus not type stable. Consider using `cont = zero(eltype(x))`, similar to sdanisch’s version.

Also note that `sum` uses a pairwise-summation algorithm, which is substantially more accurate than a simple loop like this.

Summing a bunch of numbers is just a poor case for parallelization. You only want to think about parallelism when you have a computational task that is so slow it is impeding your work, and you have already made a good effort at speeding up the serial code (including going through the Julia performance tips … often the code by inexperienced Julia programmers can be sped up by a factor of 10 or more).

1 Like

The serial version is always faster. Thanks for the answer

Thanks, I make this test with only sum because I want to parallelize a cellular automata but the same algorithm is faster in serial version than the version with parallelism and I don’t understend why Putting my code review hat on, here’s a slightly cleaner way to split an array like that (less code repetition, it’s also more consistent since, as @stevengj points out, `sum` does not simply sum elements):

``````function threadsum(x::AbstractArray{T}) where T
M = length(x)