Parallel computing doesn't work in my case?

I tried both multi-threading and distributed computing. However, it seems that neither improves the performance; the best way is just to use regular for loop.

using BenchmarkTools
using Distributed
addprocs(2 - nprocs())
@everywhere include("view_vector.jl")


size = 100
b = Vector{Matrix{Float64}}(undef, 10000)

for i = 1:length(b)
    b[i] = rand(size, size)
end

println("-----------------------")
@btime test.testFOR(b, size)
println("-------------- simd --------")
@btime test.testSIMD(b, size)
println(" ========= multi-threading =========")
@btime test.testTHREADS(b, size)
@btime test.test3FLOOP(b, size)
println("---------- Distributed  -------------")

@btime test.testDIST(b, size)

Below is the module view_vector.jl:

@everywhere module test
using BenchmarkTools
using FLoops
using SharedArrays
using Distributed


@inbounds function testFOR(b, size)
    t = zeros(size, size)
    for i in 1:length(b)
        t += b[i]
    end
end

@inbounds function testDIST(b, size)

    t = SharedArray(zeros(size, size))
    @sync @distributed for i in 1:length(b)
        t += b[i]
    end
end


@inbounds function testTHREADS(b, size)
    t = zeros(size, size)
    Threads.@threads for i in 1:length(b)
        t += b[i]
    end
end

@inbounds function test3FLOOP(b, size)
    t = zeros(size, size)
    @floop for i in 1:length(b)
        @reduce t += b[i]
    end
end

@inbounds function testSIMD(b, size)
    t = zeros(size, size)
    @simd for i in 1:10000
        t += b[i]
    end
end

end

The result is:

WARNING: replacing module test.
WARNING: replacing module test.
      From worker 2:    WARNING: replacing module test.
      From worker 2:    WARNING: replacing module test.
-----------------------
-----------------------
  204.368 ms (30003 allocations: 763.78 MiB)
-------------- simd --------
  228.540 ms (30003 allocations: 763.78 MiB)
 ========= multi-threading =========
  337.599 ms (30026 allocations: 763.78 MiB)
  332.548 ms (30026 allocations: 763.70 MiB)
---------- Distributed  -------------
  1.358 s (120641 allocations: 4.06 MiB)
Task (done) @0x0000025599f683f0

Welcome to Julia!

A good start is to optimize your single-threaded performance. In this case, that means fixing your allocations problem. Does seem like you should need 30000 additional allocations to add up these already-allocated arrays? Poor memory use is a big blocker of parallel performance. This should fix it:

function testFOR_dot(b, size)
    t = zeros(size, size)
    for i in eachindex(b)
        t .+= b[i]
    end
end
julia> @btime testFOR($b, $size);
  81.634 ms (30003 allocations: 763.78 MiB)

julia> @btime testFOR_dot($b, $size);
  40.489 ms (3 allocations: 78.20 KiB)

But I guess this only cost 2x performance in this case. Anyway, maybe now with 10000x less memory allocation you might see more benefit from parallelization. But in any case, note that this calculation (a loop of loops to perform a simple sum of elements of arrays) is unlikely to see fantastic speedups from parallelization because it will ultimately be limited by memory bandwidth.

Note that @simd should not be expected to do anything here. This isn’t the sort of loop @simd can help with. Note that SIMD is already used in the individual t += b[i] (or .+=) statements, where it is giving significant benefit.

I’m not sure whether @distributed, @threads, or @floop are effective in this context or not, but they could be. On the other hand, I would worry that this use of @distributed or @threads might suffer from a race condition that could give the wrong results – you might want to check on them. I don’t know anything about @floop but I suspect the @reduce might at least let it avoid that hazard.

3 Likes

I can confirm that just using @threads will indeed lead to incorrect outputs due to race conditions. Using @floop and @reduce avoids these issues, while also being faster than the single-threaded version.

Edit: Modified testFLOOPS to move away from the incorrect example in the docs (though in contrast to in testTHREADS I did not empirically observe any issues due to race conditions).

@btime testFOR_dot($b, $size);  # 35.885 ms (3 allocations: 78.20 KiB)


function testFLOOPS(b, size)
    @floop for i in 1:length(b)
        @inbounds bmat = b[i]
        @reduce() do (t = zeros(size, size); bmat)
            t .+= bmat
        end
    end
    return t
 end

# nthreads() == 8
@btime testFLOOPS($b, $size);  # 22.462 ms (83 allocations: 630.39 KiB)

(Using @spawn I get essentially the same timing, so this seems like a reasonable result.)

spawn
function testSPAWN(b, size)
    tasks = map(collect(Iterators.partition(eachindex(b), cld(length(b), nthreads())))) do ids
        Threads.@spawn begin
            thread_t = zeros(size, size)
            for i in ids
                thread_t .+= b[i]
            end
            return thread_t
        end
    end

    t = zeros(size, size)
    for task in tasks
        t .+= fetch(task)
    end

    return t
end

@btime testSPAWN($b, $size);  #   22.493 ms (79 allocations: 709.27 KiB)

I also already got a slight speed-up going from testFOR to testTHREADS (ignoring that the output is incorrect): 148.400 ms vs 139.990 ms. @D_W, did you start Julia with (e.g.) -t auto?

1 Like

Thank you both. I replaced “+=” with “.+=” and got speed boost:

-----------------------
  40.618 ms (3 allocations: 78.20 KiB)  --> testFOR
-------------- simd --------
  40.271 ms (3 allocations: 78.20 KiB) --> testSIMD
 ========= multi-threading =========
  25.667 ms (25 allocations: 80.73 KiB) --> testTHREADS
  22.529 ms (42 allocations: 393.05 KiB) --> test3FLOOP

However, the result from testTHREADS is obviously incorrect. The one from the test3FLOOP is close though different (e.g., an element from testFOR is 5015.350429849807, while the corresponding one from test3FLOOP is 5015.350429849821). Maybe this is ok since only the last two digits are different? I tried testSPAWN and testFLOOPS suggested by eldee and still got close but different results compared with those from testFOR.

1 Like

Yes, this difference is negligible and will be due to the different order of summation (and induced rounding differences) between the various implementations. In fact, in general I’d imagine that the simple for loop would be the least numerically stable.

1 Like