Individual threads superfast, while sync really slow with virtually no communication

Hey there!

So I am currently trying to understand how parallelization works in Julia, as a practice I tried writing this nice pi estimator:

function estimate_pi(number)
    m=0.
    for i=1:number
        x=rand()
        y=rand()
        if x^2+y^2 < 1
            m+=1
        end
    end
    pi = 4 * m/number
    return pi
            
end

This works quite nicely, for

@time estimate_pi(100_000_000)

I get the output

0.206207 seconds

Now I thought: Hey this should be easily parallelized. I am running Julia with 16 threads on a 32 core CPU. So this should not be an issue. I first tried

Threads.@threads

which made it rather slower than faster, now I am trying this version

function estimate_pi_parallel(number)
    m=SharedArray(zeros(10))
    @time @sync for j=1:10
        @time Threads.@spawn for i=1:Int(number/10)
            x=rand()
            y=rand()
            if x^2+y^2 ≤ 1
                m[j]+=1
            end
        end
    end 
    pi = 4 .* m ./ number |> sum
    return pi            
end

Note that I am testing the outer loop and the inner loop independently. Furthermore since I have 16 threads running in Julia and there are only 10 submitted all should run in parallel. The output I get is the following:

estimate_pi_parallel(100_000_000)

  0.000002 seconds (6 allocations: 592 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.000013 seconds (5 allocations: 512 bytes)
  0.000001 seconds (6 allocations: 848 bytes)
  0.000001 seconds (5 allocations: 512 bytes)
  0.384544 seconds (52.97 k allocations: 3.669 MiB, 19.82% compilation time)

The last one is the timing for the overall @sync for loop. This is slower than the sequential version.

Now my question:

Why is this and how can I fix it.

Thank you very much in advance!

PS. It turns out that the parallelized version scales as the sequential one. I.e. it is always about half as fast as the sequential one regardless of the number of iterations

Part of the problem is m=SharedArray(zeros(10)) you want m=zeros(10). SharedArray is for distributed programing. Threads can share memory.

2 Likes

Thanks for that answer. But I tried it without SharedArray before and that was also not better

Try to keep stuff in registers instead of committing to memory!
I.e.

function count_pi(number)
    m=0.
    for i=1:number
        x=rand()
        y=rand()
        if x^2+y^2 < 1
            m+=1
        end
    end
   return m
end

function estimate_pi_parallel(number)
       m = zeros(10)
       Threads.@threads for j=1:10
       m[j] = count_pi(Int(number/10))
       end
       4 * sum(m) / number
       end

The reason for the abysmal performance instead of just bad performance is false sharing.

You could mitigate that by allocating m = zeros(8 * 10) and writing m[8*j] += 1. But that is a complete anti-pattern – you only need to write to memory when your inner loop is finished!

1 Like

Thank you soo much! That fixed it