# Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling

Dear all, I’m trying to solve some toy problems in Julia and benchmark them since there often are many different ways to go. I have a strong Matlab background therefore it feels a bit weird to go back to simple loops (and I would not prefer to do that) due to their verbosity. But in case you tell me that’s the recommended way I’ll give it a try However, here are 9 different approaches to calculate Pi via Monte Carlo sampling (see BasicProblems (ucidatascienceinitiative.github.io)) and I’d be extremely happy if you can point out the best version or propose your best one to solve that issue.

``````problem4 = (
function(N) #     1.039 ms (15 allocations: 5.36 MiB)
r = rand(N,2)
cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1) # inside the circle
pi = cnt/N * 4
end
,
function(N) #     1.167 ms (15 allocations: 5.36 MiB)
r = rand(2,N)
cnt = count(r[1,:].^2 + r[2,:].^2 .<= 1)
pi = cnt/N * 4
end
,
function(N) #     3.456 ms (7 allocations: 5.36 MiB)
r = rand(2,N)
cnt = count(norm.(eachcol(r)) .<= 1)
pi = cnt/N * 4
end
,
function(N) #     3.381 ms (7 allocations: 5.36 MiB)
r = rand(N,2)
cnt = count(norm.(eachrow(r)) .<= 1)
pi = cnt/N * 4
end
,
function(N) #     2.568 ms (9 allocations: 3.83 MiB)
r = rand(2,N)
cnt = count(sum(abs2.(r), dims=1) .<= 1)
pi = cnt/N * 4
end
,
function(N) #     1.383 ms (13 allocations: 3.83 MiB)
r = rand(N,2)
cnt = count(sum(abs2.(r), dims=2) .<= 1)
pi = cnt/N * 4
end
,
function(N) #     20.271 ms (200000 allocations: 18.31 MiB)
cnt = count(sum(rand(2).^2) <= 1 for i in 1:N )
pi = cnt/N * 4
end
,
function(N) #     9.725 ms (100000 allocations: 9.16 MiB)
cnt = count(i->norm(rand(2)) <= 1, 1:N) # no difference to count(norm(rand(2)) <= 1 for i in 1:N )
pi = cnt/N * 4
end
,
function(N) #     14.534 ms (100000 allocations: 9.16 MiB)
cnt = 0;
for i in 1:N
if norm(rand(2)) <= 1
cnt += 1
end
end
pi = cnt/N * 4
end
,
)

using Random
using BenchmarkTools
using LinearAlgebra

n = 100000
res = []
for f in problem4
Random.seed!(0)
push!(res, f(n))
@btime \$f(\$n);
end

res
``````

I’m quite unsure if I should watch for the total time or the numbers of allocations. Since it’s clear that preallocation will come to some limits for high numbers of `N`, I’m looking for iterative solutions, but all generator or explicit loop based solutions are slower (in this example).

Which one would you choose, based on performance and (imo also very important) readability?

Thanks,
Jan

You forgot to include the fastest solution which fully removes the allocations by sticking to scalar calculations:

``````function f(N)
cnt = 0;
for i in 1:N
if rand()^2 + rand()^2 <= 1
cnt += 1
end
end
pi = cnt/N * 4
end

using BenchmarkTools
n = 100000
@btime \$f(\$n) # 611.200 μs (0 allocations: 0 bytes)
``````

GitHub - JuliaSIMD/VectorizedRNG.jl: Vectorized uniform and normal random samplers. could be a nice asset for optimizing this example as well.

8 Likes

If you don’t want loops, this is also not allocating, and I find it quite readable:

``````f(N) = sum(rand()^2 + rand()^2 <= 1 for _ in 1:N) / N * 4
``````

Unfortunately, this is about two times slower than the explicit loop on my machine.

2 Likes

For small N, you can build a data structure that informs the compiler the size of what you are doing, and the function will specialize to that size, possibly being very fast and non-allocating:

``````# original
julia> function f(N)
r = rand(N,2)
cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1)
pi = cnt/N * 4
end
f (generic function with 2 methods)

julia> @btime f(10)
285.348 ns (8 allocations: 1.14 KiB)
4.0

# now the trick
julia> using StaticArrays

julia> struct MyInteger{N}
i::Int
end

julia> function f(n::MyInteger{N}) where N
r = rand(SMatrix{N,2,Float64})
cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1)
pi = cnt/N * 4
end
f (generic function with 2 methods)

julia> n = MyInteger{10}(10)
MyInteger{10}(10)

julia> @btime f(\$n)
43.885 ns (0 allocations: 0 bytes)
2.4

``````

For large N the good thing about writing loops is that you can parallelize them:

``````julia> using FLoops

julia> function f(N)
@floop for i in 1:N
if rand()^2 + rand()^2 <= 1
@reduce(cnt += 1)
end
end
pi = cnt/N * 4
end
f (generic function with 1 method)

julia> @btime f(100000)
317.370 μs (111 allocations: 5.75 KiB)
3.13936

``````

Cool, so many answers, thank you very much for that!

@ChrisRackauckas, thanks for that version. It’s pretty fast indeed on my machine. However I don’t understand why `rand()^2 + rand()^2` does not allocate memory whereas `norm(rand(2))` does. Do scalar returning function not need memory allocation?

I also tried the Vectorized RNG, it’s a bit faster in vectorized version:

``````function(N) #     993 µs, 3.83 MiB
r = rand(local_rng(),N,2)
cnt = count(sum(abs2.(r), dims=2) .<= 1)
pi = cnt/N * 4
end
,
function(N) #     1.1 ms (13 allocations: 3.83 MiB)
r = rand(N,2)
cnt = count(sum(abs2.(r), dims=2) .<= 1)
pi = cnt/N * 4
end
``````

But much faster in scalar loop version (it’s now the fastest variant):

``````function(N) #  813.600 μs (0 allocations: 0 bytes)
cnt = 0;
for i in 1:N
if rand()^2 + rand()^2 <= 1
cnt += 1
end
end
pi = cnt/N * 4
end
,
function(N) # 364.900 μs (0 allocations: 0 bytes)
cnt = 0;
for i in 1:N
if rand(local_rng())^2 + rand(local_rng())^2 <= 1
cnt += 1
end
end
pi = cnt/N * 4
end
``````

Very interesting!

@yha, your version is indeed fast (1.5 ms, 0 allocations) and it does not matter if `sum` or `count` is used. I like that one, thanks! Btw, with Vectorized RNG, it needs 900 µs which is nice.
Crazy, that `count(sum(rand(2).^2) <= 1 for i in 1:N )` needs 20 ms and 18 MiB @lmiq, that’s very interesting but if I understand it right, the limit for “small” N needs to be investigated for each problem, right?

Regarding parallelization I fully get the point of good ol for loops there edit, the parallel loops make things worse:

`````` function(N) # 364.900 μs (0 allocations: 0 bytes)
cnt = 0;
for i in 1:N
if rand(local_rng())^2 + rand(local_rng())^2 <= 1
cnt += 1
end
end
pi = cnt/N * 4
end
,
function(N) # 970.800 μs (12 allocations: 336 bytes)
cnt = 0;
@floop for i in 1:N
if rand(local_rng())^2 + rand(local_rng())^2 <= 1
@reduce(cnt += 1)
end
end
pi = cnt/N * 4
end
``````

Are there any rule of thumbs when to use or when to avoid vectorized code? Seems like vectorized/broadcasted code or (??) non-scalar code in generators is not the best idea when it casues additional allocations.

Thanks for your valuable input! If there is more I can try out or optimize please don’t hesitate to tell me Thanks,
Jan

Actually that is the limit where static arrays are useful, meaning about 100 elements.

About the more general question on memory allocation, here are my own notes (from a perspective of a Fortran programmer, not sure how helpful they will be): Immutable variables · JuliaNotes.jl.

What allocates there is `rand(2)`, which creates a mutable vector of length 2. On the contrary, `rand(SVector{2,Float64})` creates an immutable vector of length 2, which is not allocated (on the heap). Scalars are always immutable, thus they never allocate (as the `rand()`).

did you start julia with `julia -t auto` (or `julia -t 4`) where the `4` is the number of threads?

1 Like

You can write a fast multithreaded version like this.

``````using .Threads

calc_pi(M) = (rng=MersenneTwister(); sum(rand(rng)^2 + rand(rng)^2 < 1 for i=1:M) / M)

function calc_pi_parallel(M)
ps = zeros(nt)
end
4 * sum(ps) / nt  # average value
end

@btime calc_pi_parallel(100_000)
91.200 μs (138 allocations: 161.55 KiB)
3.1550400000000005
``````

Make sure that you start Julia with the `--threads=auto` flag for this to work correctly.
The serial version could also be written like this and is quite fast.

``````function calc_pi_serial(N)
cnt = 0
rng = MersenneTwister()
for i in 1:N
cnt += rand(rng)^2 + rand(rng)^2 < 1
end
return cnt/N * 4
end

288.900 μs (12 allocations: 19.66 KiB)
``````

It looks like we have suboptimal performance because of unlucky inlining decision. Indeed, it performs quite well if we add an inline hint to the summand function

``````julia> @inline sample(_) = rand()^2 + rand()^2 <= 1;

julia> f_sum(N) = sum(sample, 1:N) / N * 4;

julia> @btime f_sum(100000);
715.944 μs (0 allocations: 0 bytes)
``````

which is comparable to the manual `for` loop

``````julia> function f_for(N)
cnt = 0;
for i in 1:N
if rand()^2 + rand()^2 <= 1
cnt += 1
end
end
pi = cnt/N * 4
end;

julia> @btime f_for(100000);
712.774 μs (0 allocations: 0 bytes)
``````

and faster than the generator version

``````julia> f_gen(N) = sum(rand()^2 + rand()^2 <= 1 for _ in 1:N) / N * 4;

julia> @btime f_gen(100000);
1.092 ms (0 allocations: 0 bytes)
``````

I think constructing algorithms by composing well-known function like `sum` is a great approach since parallelization is as easy as swapping the implementation of `sum`. For example, by slapping `Folds.` in front of it

``````Folds.sum(sample, 1:N) / N * 4
``````

which also works on GPU

``````using FoldsCUDA
Folds.sum(sample, 1:N, CUDAEx()) / N * 4
``````

and distributed machines

``````Folds.sum(sample, 1:N, DistributedEX()) / N * 4
``````
7 Likes