# Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba)

Hello, I’m benchmarking Julia against some other langauges, mainly against Cython, Numba etc (in this case mainly Numba). I’m running a simple Monte Carlo estimate of Pi. Trying to recreate Python+Numba vs. Julia, and expand it to parallel execution.

For non-parallel Julia is slightly faster, not as much as the post, but reasonably quicker. However the threading for Julia seems to slow it down (thus Numba can be faster when in parallel). Any ideas why?
I’m still trying to wrap my head around all the different optimisations, @simd, @avx, @distributed, Threads.@threads, etc

I get the following running Julia (with `export JULIA_NUM_THREADS = 4`):

``````3.092
115.656 ms (1 allocation: 16 bytes)
3.142268
3.1
398.524 ms (7851017 allocations: 119.80 MiB)
3.1422492
``````

and running Numba (parallel is speeding it up):

``````3.196
147 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.16
71.8 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
``````

The Julia code

``````using BenchmarkTools

function estimate_pi(nMC)
n_circle = 0
for i in 1:nMC
x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
n_circle += 1
end
end
return (n_circle / nMC) * 4.
end

# threaded version (think this is the right approach)
n_circle = 0
x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
n_circle += 1
end
end
return (n_circle / nMC) * 4.
end

nMC2 = 10000000

println(estimate_pi(1000)) # to warm up fn
@btime pi_est =estimate_pi(nMC2)

println(estimate_pi_thread(1000)) # to warm up fn
``````

Python/Numba code

``````from time import time

import numba
import numpy as np

@numba.njit
def estimate_pi_numba(nMC):
n_circle = 0
for i in range(nMC):
x = (np.random.random() - 0.5) * diameter
y = (np.random.random() - 0.5) * diameter
r = np.sqrt(x ** 2 + y ** 2)
n_circle += 1
return 4. * n_circle / nMC

# parallel version
@numba.njit(parallel=True)
def estimate_pi_numba_prange(nMC):
n_circle = 0
for i in numba.prange(nMC):
x = (np.random.random() - 0.5) * diameter
y = (np.random.random() - 0.5) * diameter
r = np.sqrt(x ** 2 + y ** 2)
n_circle += 1
return 4. * n_circle / nMC

nMC = 10000000

print(estimate_pi_numba(1000)) # to warm up fn
%timeit -n 10 pi_est = estimate_pi_numba(nMC)

print(estimate_pi_numba_prange(1000)) # to warm up fn
%timeit -n 10 pi_est = estimate_pi_numba_prange(nMC)
``````
2 Likes

By the way, there’s no need to do this. BenchmarkTools `@btime` already takes care of this for you.

I’m pretty sure this is yet another example of performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub due to the closure created by `@threads`. You can see the problem by doing:

``````julia> @code_warntype estimate_pi_thread(1000)
Variables
nMC::Int64
n_circle@_4::Core.Box
``````

Using the usual `Ref` trick solves that problem:

``````function estimate_pi_thread(nMC)
n_circle = Ref(0)
x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
n_circle[] += 1
end
end
return (n_circle[] / nMC) * 4.
end
``````

With that change, I see the following with `julia --threads=1` (only one thread to work with):

``````julia> @btime estimate_pi(\$nMC2)
88.248 ms (0 allocations: 0 bytes)
3.1414028

137.082 ms (7 allocations: 576 bytes)
3.1417592
``````

With `julia --threads=2`, I see:

``````julia> @btime estimate_pi_thread(\$nMC2)
114.105 ms (12 allocations: 1.00 KiB)
1.7710592
``````

and with `--threads=4` I see:

``````julia> @btime estimate_pi_thread(\$nMC2)
110.060 ms (22 allocations: 1.89 KiB)
0.932612
``````

so indeed there’s not much benefit from this particular manner of threading. I would guess that there’s too little work being done in each iteration for the overhead of threading to make sense.

4 Likes

The threaded code is not correct, because you are concurrently adding to `n_circles`. You should at least save them in an array:

``````julia> # threaded version (think this is the right approach)
x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
end
end
return (sum(n_circle) / nMC) * 4.
end
estimate_pi_thread (generic function with 1 method)

``````

That seems to solve most of the problem:

``````julia> @btime estimate_pi_thread(nMC2)
76.575 ms (23 allocations: 2.00 KiB)
3.1414848

julia> @btime estimate_pi(\$nMC2)
103.041 ms (0 allocations: 0 bytes)
3.1408116

``````

(I think the numbe code is not correct either)

7 Likes

Good point! I didn’t even notice that the answers I was getting with the threaded code were completely wrong. @smpurkis the fact that you got correct answers in your threaded version might suggest that you aren’t actually using multiple threads at all. Are you starting julia with `julia --threads=<something>` ?

Also there are better parallelization models for a loop with small computations inside, like that of `Floops`:

``````julia> using FLoops
# threaded version (think this is the right approach)
function estimate_pi_floop(nMC)
@floop for i in 1:nMC
x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
@reduce(n_circle += 1)
end
end
return (n_circle / nMC) * 4.
end
estimate_pi_floop (generic function with 1 method)

``````
``````julia> @btime estimate_pi_floop(\$nMC2)
24.345 ms (58 allocations: 2.80 KiB)
3.1413176

``````
10 Likes

Aside from the threading, there are a few other tricks you can use:

• Drop the `sqrt` and compare with the radius squared (this goes for python too)
• Explicitly pass the rng to `rand`
• Drop the comparison branch
``````using Random: default_rng
function est_pi(N)
n_circle = 0
rng = default_rng()
for i in 1:N
x = (rand(rng) - 0.5) * diameter
y = (rand(rng) - 0.5) * diameter
r2 = x^2 + y^2  # look, no sqrt!
n_circle += (r2 < rad2)  # no branch
end
return 4 * n_circle / N
end
``````
6 Likes

I set `export JULIA_NUM_THREADS = 4` before running the code in REPL. I presumed that would take affect in the REPL

it should work Maybe those spaces between `S = 4` are wrong? (in bash they error)

you can just use:

`julia -t4`

1 Like

Flip you are totally right! Removing the spaces, now gives a wrong estimate

``````julia> @btime pi_est = estimate_pi_thread(nMC2)
367.631 ms (7851824 allocations: 119.81 MiB)
0.899592
``````

by the way, you check if you are using threads within the Julia session by calling `Threads.nthreads()`

1 Like

Thank you for the tip! Will take a look at FLoops package. Am still very new to Julia, only recently heard about LoopVectorization, tried it with this using @avx but gave incorrect results. Learning when to use which optimisations

That will also sort of parallelize the loop, so you will have the same concurrency problem (and wrong results). Loop vectorization should be used when the operations are independent on each iteration of the loop.

Thank you for these tips! I didn’t realize there were such simple optimisations for this. I mostly used the code from Python+Numba vs. Julia

3 Likes

I tried it like this

``````using LoopVectorization
using BenchmarkTools

x = (rand() - 0.5) * diameter
y = (rand() - 0.5) * diameter
r = sqrt(x^2 + y^2)
return 1
end
return 0
end

function estimate_pi_avx(nMC)
n_circles = Vector{Bool}(undef, nMC)
@avx for i in 1:nMC
n_circles[i] = XYinCircle(radius, diameter)  # independent from each other
end
n_circle = sum(n_circles)
return (n_circle / nMC) * 4.
end
``````

In REPL:

``````julia> estimate_pi_avx(100000)
4.0
``````

Seems `n_circles` is either all 0s or all 1s using @avx

Thanks, have taken a look, it is amazing that Julia can be that fast! Have to say the optimised Julia version goes waaaay over my head XD

It can be annoying, but multithreaded `rand` is (currently) fastest if you manually hoist out the access to the thread-local global RNG:

3 Likes

Indeed, that speeds up by a factor of 2 both alternatives. Copying the solution from here: Random number and parallel execution - #16 by greg_plowman

We have:

``````julia> Threads.nthreads()
4

35.675 ms (22 allocations: 1.98 KiB)
3.1413444

julia> @btime estimate_pi_floop(10000000)
12.244 ms (57 allocations: 2.77 KiB)
3.1405184

``````

But the code is quite boring for that:

Code
``````using Random, Future, BenchmarkTools, FLoops

function parallel_rngs(rng::MersenneTwister, n::Integer)
step = big(10)^20
rngs = Vector{MersenneTwister}(undef, n)
rngs = copy(rng)
for i = 2:n
rngs[i] = Future.randjump(rngs[i-1], step)
end
return rngs
end

const rng = MersenneTwister();

x = (rand(rng) - 0.5) * diameter
y = (rand(rng) - 0.5) * diameter
r = sqrt(x^2 + y^2)
end
end
return (sum(n_circle) / nMC) * 4.
end

function estimate_pi_floop(nMC)
@floop for i in 1:nMC
x = (rand(rng) - 0.5) * diameter
y = (rand(rng) - 0.5) * diameter
r  = sqrt(x^2 + y^2)
@reduce(n_circle += 1)
end
end
return (n_circle / nMC) * 4.
end

``````
1 Like

Probably they aren’t, since `rand()` generates a sequence. I think you cannot use loop vectorization with a rand inside. Others may be more specific about that, but maybe that could simply error out.

Look how things don’t go well:

``````julia> function f() # with avx
s = 0.
@avx for i in 1:100
s += rand()
end
s/100
end
f (generic function with 1 method)

julia> f()  # results vary a lot
0.7592638464484608

julia> f()
0.4069099370815508

julia> f()
0.8003393396344347

julia> function f() # no avx
s = 0.
for i in 1:100
s += rand()
end
s/100
end
f (generic function with 1 method)

julia> f() # always close to 0.5
0.5316257501143982

julia> f()
0.5244277946478424

julia> f()
0.47138753152473184

julia> f()
0.49949114253662863

``````
2 Likes

I would like to further clarify the conclusion about the multi-threaded version. (Jupyter notebook)

``````using BenchmarkTools
using Random

function mcpi(N)
rng = MersenneTwister()
c = 0
for _ in 1:N
c += rand(rng)^2 + rand(rng)^2 ≤ 1
end
4c/N
end

println("Julia v", VERSION)
print("mcpi(10^8):")
@btime mcpi(10^8)
``````

Julia v1.6.2
mcpi(10^8): 222.733 ms (12 allocations: 19.66 KiB)
3.14193032

A proper multi-threaded version of this gets a bit complicated:

``````using Base.Threads
using Distributed: splitrange
using Random

function pi_mcmc_julia5(N)
a = Atomic{Int}(0)
rng = MersenneTwister()
c = 0
for _ in ran
c += rand(rng)^2 + rand(rng)^2 ≤ 1
end
end
4a[]/N
end

println("Julia v", VERSION)
print("pi_mcmc_julia5(10^8):")
@btime pi_mcmc_julia5(10^8)
``````

Julia v1.6.2
pi_mcmc_julia5(10^8): 37.794 ms (207 allocations: 242.33 KiB)
3.14158332

223 ms → (12 threads) → 38 ms

The reason for the complexity is that it requires pre-processing and post-processing of the for loop in each thread.

So I created, a few months ago, the `@my_theads` macro to specify the pre-processing and post-processing in each thread.

``````# The following code is a modified version of
#
#
#

function _my_threadsfor(iter, lbody, prebody, postbody, schedule)
lidx = iter.args         # index
range = iter.args
quote
let range = \$(esc(range))
r = range # Load into local variable
lenr = length(r)
# divide loop iterations among threads
tid = 1
len, rem = lenr, 0
else
end
# not enough iterations for all the threads?
if len == 0
if tid > rem
return
end
len, rem = 1, 0
end
f = firstindex(r) + ((tid-1) * len)
l = f + len - 1
# distribute remaining iterations evenly
if rem > 0
if tid <= rem
f = f + (tid-1)
l = l + tid
else
f = f + rem
l = l + rem
end
end
# run prebody
\$(esc(prebody))
for i = f:l
local \$(esc(lidx)) = @inbounds r[i]
\$(esc(lbody))
end
# run postbody
\$(esc(postbody))
end
end
\$(if schedule === :static
:(error("`@my_threads :static` can only be used from thread 1 and not nested"))
else
end)
else
end
nothing
end
end

"""
A macro to parallelize a `for` loop to run with multiple threads.
It splits the iteration space among multiple tasks with `prebody` and `postbody`.
Usage:
```julia
prebody
end for ...
...
end begin
postbody
end
``````

“”"
na = length(args)
if na == 4
sched, prebody, ex, bostbody = args
if sched isa QuoteNode
sched = sched.value
elseif sched isa Symbol
# for now only allow quoted symbols
sched = nothing
end
if sched !== :static
end
elseif na == 3
sched = :default
prebody, ex, postbody = args
else
throw(ArgumentError(“wrong number of arguments in @my_threads”))
end
if !(isa(ex, Expr) && ex.head === :for)
throw(ArgumentError("@my_threads requires a `for` loop expression"))
end
if !(ex.args isa Expr && ex.args.head === :(=))
throw(ArgumentError(“nested outer loops are not currently supported by @my_threads”))
end
return _my_threadsfor(ex.args, ex.args, prebody, postbody, sched)
end

The `@my_threads` macro makes it simple to write the code of the multi-threaded version as below:

``````using Random

a = Atomic{Int}(0)
rng = MersenneTwister()
c = 0
end for _ in 1:N
c += rand(rng)^2 + rand(rng)^2 ≤ 1
end begin
end
4a[]/N
end

println("Julia v", VERSION)