I have an i7 with 8 logical process but the sum of an array is faster in the sequential versione than in the parallel. Why? Am I doing something wrong? Thanks for the answers.
using Distributed
x = rand(1000000)
#println(x)
dim = size(x)[1]
println(dim)
t = @elapsed begin
let
cont = 0
dim = size(x)[1]
println(dim)
for i in 1:dim
cont = cont + x[i]
end
println(cont)
end
end
println(t)
if (nworkers() != 4)
addprocs(4)
end
t1 = @elapsed begin
cont1 = @distributed (+) for i in 1:size(x)[1]
x[i]
end
println(cont1)
end
println(t1)
Parallelizing imposes a big communications overhead, which is especially huge in this example because your array x
is all stored on the main process and the contents x[i]
have to be sent to the subprocesses.
Also, don’t benchmark in global scope.
1 Like
Thanks for the answer, I tried to use a SharedArray but is the same thing. Any suggestion?
using Distributed, SharedArrays
x = rand(1000000)
#println(x)
dim = size(x)[1]
println(dim)
t = @elapsed begin
let
cont = 0
dim = size(x)[1]
println(dim)
for i in 1:dim
cont = cont + x[i]
end
println(cont)
end
end
println(t)
if (nworkers() != 4)
addprocs(4)
end
x1 = SharedArray(x)
t1 = @elapsed begin
cont1 = @distributed (+) for i in 1:size(x1)[1]
x1[i]
end
println(cont1)
end
println(t1)
sum(x)
is simd accelerated, and therefore already ~4x faster than a real serial sum.
So the first challenge is to preserve simd acceleration in your reduction. The second challenge is, that even with ~10^7 element arrays, a sum only takes around ~3.5ms - a low number that can easily be eaten up any multi threading overhead.
This sum is a bit faster (1.4x) than Base.sum:
function threadsum(x::AbstractArray{T}) where T
N = Threads.nthreads()
result = similar(x, N)
segment = floor(Int, length(x) / N)
Threads.@threads for id in 1:N
accum = zero(T)
start = (id - 1) * segment + 1
@simd for i in start:(start + segment - 1)
@inbounds (accum += x[i])
end
@inbounds result[id] = accum
end
res = sum(result)
if segment * N != length(x)
res += sum(view(x, (segment * N + 1):length(x)))
end
res
end
1 Like
Your variable cont
changes type from Int
to Float64
and is thus not type stable. Consider using cont = zero(eltype(x))
, similar to sdanisch’s version.
Also note that sum
uses a pairwise-summation algorithm, which is substantially more accurate than a simple loop like this.
Summing a bunch of numbers is just a poor case for parallelization. You only want to think about parallelism when you have a computational task that is so slow it is impeding your work, and you have already made a good effort at speeding up the serial code (including going through the Julia performance tips … often the code by inexperienced Julia programmers can be sped up by a factor of 10 or more).
1 Like
The serial version is always faster. Thanks for the answer
Thanks, I make this test with only sum because I want to parallelize a cellular automata but the same algorithm is faster in serial version than the version with parallelism and I don’t understend why
Putting my code review hat on, here’s a slightly cleaner way to split an array like that (less code repetition, it’s also more consistent since, as @stevengj points out, sum
does not simply sum elements):
function threadsum(x::AbstractArray{T}) where T
M = length(x)
N = Threads.nthreads()
result = similar(x, N)
@inbounds Threads.@threads for t = 1:N
accum = zero(T)
@simd for i = round(Int, M*(t-1)/N)+1:round(Int, M*t/N)
accum += x[i]
end
result[t] = accum
end
sum(result)
end
1 Like