The sum of an array is faster in the sequential version than in the parallelized version


#1

I have an i7 with 8 logical process but the sum of an array is faster in the sequential versione than in the parallel. Why? Am I doing something wrong? Thanks for the answers.

using Distributed

x = rand(1000000)
#println(x)

dim = size(x)[1]
println(dim)

t = @elapsed begin
    let
        cont = 0
        dim = size(x)[1]
        println(dim)
        for i in 1:dim

            cont = cont + x[i]

        end
        println(cont)
    end
end
println(t)

if (nworkers() != 4)
    addprocs(4)
end

t1 = @elapsed begin


    cont1 = @distributed (+) for i in 1:size(x)[1]
        x[i]
    end
    println(cont1)

end
println(t1)


#2

Parallelizing imposes a big communications overhead, which is especially huge in this example because your array x is all stored on the main process and the contents x[i] have to be sent to the subprocesses.

Also, don’t benchmark in global scope.


#3

Thanks for the answer, I tried to use a SharedArray but is the same thing. Any suggestion?


#4
using Distributed, SharedArrays

x = rand(1000000)
#println(x)

dim = size(x)[1]
println(dim)

t = @elapsed begin
    let
        cont = 0
        dim = size(x)[1]
        println(dim)
        for i in 1:dim

            cont = cont + x[i]

        end
        println(cont)
    end
end
println(t)

if (nworkers() != 4)
    addprocs(4)
end

x1 = SharedArray(x)


t1 = @elapsed begin


    cont1 = @distributed (+) for i in 1:size(x1)[1]
        x1[i]
    end
    println(cont1)

end
println(t1)

#5

sum(x) is simd accelerated, and therefore already ~4x faster than a real serial sum.
So the first challenge is to preserve simd acceleration in your reduction. The second challenge is, that even with ~10^7 element arrays, a sum only takes around ~3.5ms - a low number that can easily be eaten up any multi threading overhead.
This sum is a bit faster (1.4x) than Base.sum:

function threadsum(x::AbstractArray{T}) where T
    N = Threads.nthreads()
    result = similar(x, N)
    segment = floor(Int, length(x) / N) 
    Threads.@threads for id in 1:N
        accum = zero(T)
        start = (id - 1) * segment + 1
        @simd for i in start:(start + segment - 1)
            @inbounds (accum += x[i])
        end
        @inbounds result[id] = accum
    end
    res = sum(result)
    if segment * N != length(x)
        res += sum(view(x, (segment * N + 1):length(x)))
    end
    res
end

#6

Your variable cont changes type from Int to Float64 and is thus not type stable. Consider using cont = zero(eltype(x)), similar to sdanisch’s version.


#7

Also note that sum uses a pairwise-summation algorithm, which is substantially more accurate than a simple loop like this.

Summing a bunch of numbers is just a poor case for parallelization. You only want to think about parallelism when you have a computational task that is so slow it is impeding your work, and you have already made a good effort at speeding up the serial code (including going through the Julia performance tips … often the code by inexperienced Julia programmers can be sped up by a factor of 10 or more).


#8

The serial version is always faster. Thanks for the answer


#9

Thanks, I make this test with only sum because I want to parallelize a cellular automata but the same algorithm is faster in serial version than the version with parallelism and I don’t understend why :confused:


#10

Putting my code review hat on, here’s a slightly cleaner way to split an array like that (less code repetition, it’s also more consistent since, as @stevengj points out, sum does not simply sum elements):

function threadsum(x::AbstractArray{T}) where T
    M = length(x)
    N = Threads.nthreads()
    result = similar(x, N)
    @inbounds Threads.@threads for t = 1:N
        accum = zero(T)
        @simd for i = round(Int, M*(t-1)/N)+1:round(Int, M*t/N)
            accum += x[i]
        end
        result[t] = accum
    end
    sum(result)
end