Hey there!

So I am currently trying to understand how parallelization works in Julia, as a practice I tried writing this nice pi estimator:

```
function estimate_pi(number)
m=0.
for i=1:number
x=rand()
y=rand()
if x^2+y^2 < 1
m+=1
end
end
pi = 4 * m/number
return pi
end
```

This works quite nicely, for

```
@time estimate_pi(100_000_000)
```

I get the output

```
0.206207 seconds
```

Now I thought: Hey this should be easily parallelized. I am running Julia with 16 threads on a 32 core CPU. So this should not be an issue. I first tried

```
Threads.@threads
```

which made it rather slower than faster, now I am trying this version

```
function estimate_pi_parallel(number)
m=SharedArray(zeros(10))
@time @sync for j=1:10
@time Threads.@spawn for i=1:Int(number/10)
x=rand()
y=rand()
if x^2+y^2 ≤ 1
m[j]+=1
end
end
end
pi = 4 .* m ./ number |> sum
return pi
end
```

Note that I am testing the outer loop and the inner loop independently. Furthermore since I have 16 threads running in Julia and there are only 10 submitted all should run in parallel. The output I get is the following:

```
estimate_pi_parallel(100_000_000)
0.000002 seconds (6 allocations: 592 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.000013 seconds (5 allocations: 512 bytes)
0.000001 seconds (6 allocations: 848 bytes)
0.000001 seconds (5 allocations: 512 bytes)
0.384544 seconds (52.97 k allocations: 3.669 MiB, 19.82% compilation time)
```

The last one is the timing for the overall @sync for loop. This is slower than the sequential version.

Now my question:

Why is this and how can I fix it.

Thank you very much in advance!

PS. It turns out that the parallelized version scales as the sequential one. I.e. it is always about half as fast as the sequential one regardless of the number of iterations