Individual threads superfast, while sync really slow with virtually no communication

jabru · January 24, 2024, 4:40pm

Hey there!

So I am currently trying to understand how parallelization works in Julia, as a practice I tried writing this nice pi estimator:

function estimate_pi(number)
    m=0.
    for i=1:number
        x=rand()
        y=rand()
        if x^2+y^2 < 1
            m+=1
        end
    end
    pi = 4 * m/number
    return pi
            
end

This works quite nicely, for

@time estimate_pi(100_000_000)

I get the output

0.206207 seconds

Now I thought: Hey this should be easily parallelized. I am running Julia with 16 threads on a 32 core CPU. So this should not be an issue. I first tried

Threads.@threads

which made it rather slower than faster, now I am trying this version

function estimate_pi_parallel(number)
    m=SharedArray(zeros(10))
    @time @sync for j=1:10
        @time Threads.@spawn for i=1:Int(number/10)
            x=rand()
            y=rand()
            if x^2+y^2 ≤ 1
                m[j]+=1
            end
        end
    end 
    pi = 4 .* m ./ number |> sum
    return pi            
end

Note that I am testing the outer loop and the inner loop independently. Furthermore since I have 16 threads running in Julia and there are only 10 submitted all should run in parallel. The output I get is the following:

estimate_pi_parallel(100_000_000)

  0.000002 seconds (6 allocations: 592 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000013 seconds (5 allocations: 512 bytes)
  0.000001 seconds (6 allocations: 848 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.384544 seconds (52.97 k allocations: 3.669 MiB, 19.82% compilation time)

The last one is the timing for the overall @sync for loop. This is slower than the sequential version.

Now my question:

Why is this and how can I fix it.

Thank you very much in advance!

PS. It turns out that the parallelized version scales as the sequential one. I.e. it is always about half as fast as the sequential one regardless of the number of iterations

Oscar_Smith · January 24, 2024, 4:44pm

Part of the problem is m=SharedArray(zeros(10)) you want m=zeros(10). SharedArray is for distributed programing. Threads can share memory.

jabru · January 24, 2024, 4:47pm

Thanks for that answer. But I tried it without SharedArray before and that was also not better

foobar_lv2 · January 24, 2024, 5:37pm

Try to keep stuff in registers instead of committing to memory!
I.e.

function count_pi(number)
    m=0.
    for i=1:number
        x=rand()
        y=rand()
        if x^2+y^2 < 1
            m+=1
        end
    end
   return m
end

function estimate_pi_parallel(number)
       m = zeros(10)
       Threads.@threads for j=1:10
       m[j] = count_pi(Int(number/10))
       end
       4 * sum(m) / number
       end

The reason for the abysmal performance instead of just bad performance is false sharing.

You could mitigate that by allocating m = zeros(8 * 10) and writing m[8*j] += 1. But that is a complete anti-pattern – you only need to write to memory when your inner loop is finished!

jabru · January 24, 2024, 5:56pm

Thank you soo much! That fixed it

Topic		Replies	Views
Parallel and distributed very slow New to Julia parallel , benchmark , distributed , threads	6	1036	October 14, 2021
@threads vs @parallel, a simple fail case for @threads Performance	3	1399	October 31, 2017
Parallel computing `@parallel` New to Julia	1	564	September 26, 2017
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7350	April 23, 2021
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1966	January 20, 2023

Individual threads superfast, while sync really slow with virtually no communication

Related topics