Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling

Dear all, I’m trying to solve some toy problems in Julia and benchmark them since there often are many different ways to go. I have a strong Matlab background therefore it feels a bit weird to go back to simple loops (and I would not prefer to do that) due to their verbosity. But in case you tell me that’s the recommended way I’ll give it a try :slight_smile:

However, here are 9 different approaches to calculate Pi via Monte Carlo sampling (see BasicProblems (ucidatascienceinitiative.github.io)) and I’d be extremely happy if you can point out the best version or propose your best one to solve that issue.

problem4 = (
    function(N) #     1.039 ms (15 allocations: 5.36 MiB)
        r = rand(N,2)
        cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1) # inside the circle
        pi = cnt/N * 4
    end
    ,
    function(N) #     1.167 ms (15 allocations: 5.36 MiB)
        r = rand(2,N)
        cnt = count(r[1,:].^2 + r[2,:].^2 .<= 1)
        pi = cnt/N * 4
    end
    ,    
    function(N) #     3.456 ms (7 allocations: 5.36 MiB)
        r = rand(2,N)
        cnt = count(norm.(eachcol(r)) .<= 1)
        pi = cnt/N * 4
    end
    ,    
    function(N) #     3.381 ms (7 allocations: 5.36 MiB)
        r = rand(N,2)
        cnt = count(norm.(eachrow(r)) .<= 1)
        pi = cnt/N * 4
    end
    ,
    function(N) #     2.568 ms (9 allocations: 3.83 MiB)
        r = rand(2,N)
        cnt = count(sum(abs2.(r), dims=1) .<= 1)
        pi = cnt/N * 4
    end
    ,
    function(N) #     1.383 ms (13 allocations: 3.83 MiB)
        r = rand(N,2)
        cnt = count(sum(abs2.(r), dims=2) .<= 1)
        pi = cnt/N * 4
    end
    ,
    function(N) #     20.271 ms (200000 allocations: 18.31 MiB)
        cnt = count(sum(rand(2).^2) <= 1 for i in 1:N )
        pi = cnt/N * 4
    end
    ,
    function(N) #     9.725 ms (100000 allocations: 9.16 MiB)
        cnt = count(i->norm(rand(2)) <= 1, 1:N) # no difference to count(norm(rand(2)) <= 1 for i in 1:N )
        pi = cnt/N * 4
    end
    ,
    function(N) #     14.534 ms (100000 allocations: 9.16 MiB)
        cnt = 0;
        for i in 1:N
            if norm(rand(2)) <= 1
                cnt += 1
            end
        end
        pi = cnt/N * 4
    end
    ,
)

using Random
using BenchmarkTools
using LinearAlgebra

n = 100000
res = []
for f in problem4
    Random.seed!(0)
    push!(res, f(n))
    @btime $f($n);
end

res

I’m quite unsure if I should watch for the total time or the numbers of allocations. Since it’s clear that preallocation will come to some limits for high numbers of N, I’m looking for iterative solutions, but all generator or explicit loop based solutions are slower (in this example).

Which one would you choose, based on performance and (imo also very important) readability?

Thanks,
Jan

You forgot to include the fastest solution which fully removes the allocations by sticking to scalar calculations:

function f(N)
    cnt = 0;
    for i in 1:N
        if rand()^2 + rand()^2 <= 1
            cnt += 1
        end
    end
    pi = cnt/N * 4
end

using BenchmarkTools
n = 100000
@btime $f($n) # 611.200 μs (0 allocations: 0 bytes)

https://github.com/JuliaSIMD/VectorizedRNG.jl could be a nice asset for optimizing this example as well.

8 Likes

If you don’t want loops, this is also not allocating, and I find it quite readable:

f(N) = sum(rand()^2 + rand()^2 <= 1 for _ in 1:N) / N * 4

Unfortunately, this is about two times slower than the explicit loop on my machine.

2 Likes

For small N, you can build a data structure that informs the compiler the size of what you are doing, and the function will specialize to that size, possibly being very fast and non-allocating:

# original
julia> function f(N)
           r = rand(N,2)
           cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1)
           pi = cnt/N * 4
       end
f (generic function with 2 methods)

julia> @btime f(10)
  285.348 ns (8 allocations: 1.14 KiB)
4.0

# now the trick
julia> using StaticArrays

julia> struct MyInteger{N}
         i::Int
       end

julia> function f(n::MyInteger{N}) where N
         r = rand(SMatrix{N,2,Float64})
         cnt = count(r[:,1].^2 + r[:,2].^2 .<= 1)
         pi = cnt/N * 4
       end
f (generic function with 2 methods)

julia> n = MyInteger{10}(10)
MyInteger{10}(10)

julia> @btime f($n)
  43.885 ns (0 allocations: 0 bytes)
2.4

For large N the good thing about writing loops is that you can parallelize them:

julia> using FLoops

julia> function f(N)
           @floop for i in 1:N
               if rand()^2 + rand()^2 <= 1
                   @reduce(cnt += 1)
               end
           end
           pi = cnt/N * 4
       end
f (generic function with 1 method)

julia> @btime f(100000)
  317.370 μs (111 allocations: 5.75 KiB)
3.13936


Cool, so many answers, thank you very much for that!

@ChrisRackauckas, thanks for that version. It’s pretty fast indeed on my machine. However I don’t understand why rand()^2 + rand()^2 does not allocate memory whereas norm(rand(2)) does. Do scalar returning function not need memory allocation?

I also tried the Vectorized RNG, it’s a bit faster in vectorized version:

function(N) #     993 µs, 3.83 MiB
        r = rand(local_rng(),N,2)
        cnt = count(sum(abs2.(r), dims=2) .<= 1)
        pi = cnt/N * 4
    end
    ,
    function(N) #     1.1 ms (13 allocations: 3.83 MiB)
        r = rand(N,2)
        cnt = count(sum(abs2.(r), dims=2) .<= 1)
        pi = cnt/N * 4
    end

But much faster in scalar loop version (it’s now the fastest variant):

function(N) #  813.600 μs (0 allocations: 0 bytes)
        cnt = 0;
        for i in 1:N
            if rand()^2 + rand()^2 <= 1
                cnt += 1
            end
        end
        pi = cnt/N * 4
    end
    ,
    function(N) # 364.900 μs (0 allocations: 0 bytes)
        cnt = 0;
        for i in 1:N
            if rand(local_rng())^2 + rand(local_rng())^2 <= 1
                cnt += 1
            end
        end
        pi = cnt/N * 4
    end

Very interesting!

@yha, your version is indeed fast (1.5 ms, 0 allocations) and it does not matter if sum or count is used. I like that one, thanks! Btw, with Vectorized RNG, it needs 900 µs which is nice.
Crazy, that count(sum(rand(2).^2) <= 1 for i in 1:N ) needs 20 ms and 18 MiB :smiley:

@lmiq, that’s very interesting but if I understand it right, the limit for “small” N needs to be investigated for each problem, right?

Regarding parallelization I fully get the point of good ol for loops there :slight_smile:

edit, the parallel loops make things worse:

 function(N) # 364.900 μs (0 allocations: 0 bytes)
        cnt = 0;
        for i in 1:N
            if rand(local_rng())^2 + rand(local_rng())^2 <= 1
                cnt += 1
            end
        end
        pi = cnt/N * 4
    end
    ,
    function(N) # 970.800 μs (12 allocations: 336 bytes)
        cnt = 0;
        @floop for i in 1:N
            if rand(local_rng())^2 + rand(local_rng())^2 <= 1
                @reduce(cnt += 1)
            end
        end
        pi = cnt/N * 4
    end

Are there any rule of thumbs when to use or when to avoid vectorized code? Seems like vectorized/broadcasted code or (??) non-scalar code in generators is not the best idea when it casues additional allocations.

Thanks for your valuable input! If there is more I can try out or optimize please don’t hesitate to tell me :wink:

Thanks,
Jan

Actually that is the limit where static arrays are useful, meaning about 100 elements.

About the more general question on memory allocation, here are my own notes (from a perspective of a Fortran programmer, not sure how helpful they will be): Mutability · JuliaNotes.jl.

What allocates there is rand(2), which creates a mutable vector of length 2. On the contrary, rand(SVector{2,Float64}) creates an immutable vector of length 2, which is not allocated (on the heap). Scalars are always immutable, thus they never allocate (as the rand()).

did you start julia with julia -t auto (or julia -t 4) where the 4 is the number of threads?

1 Like

You can write a fast multithreaded version like this.

using .Threads

calc_pi(M) = (rng=MersenneTwister(); sum(rand(rng)^2 + rand(rng)^2 < 1 for i=1:M) / M)

function calc_pi_parallel(M)
    nt = nthreads()
    ps = zeros(nt)
    @threads for i in 1:nt
        ps[threadid()] += calc_pi(ceil(Int, M/nt))
    end
    4 * sum(ps) / nt  # average value
end

@btime calc_pi_parallel(100_000)
  91.200 μs (138 allocations: 161.55 KiB)
3.1550400000000005

Make sure that you start Julia with the --threads=auto flag for this to work correctly.
The serial version could also be written like this and is quite fast.

function calc_pi_serial(N) 
    cnt = 0
    rng = MersenneTwister()
    for i in 1:N
        cnt += rand(rng)^2 + rand(rng)^2 < 1
    end
    return cnt/N * 4
end

  288.900 μs (12 allocations: 19.66 KiB)

It looks like we have suboptimal performance because of unlucky inlining decision. Indeed, it performs quite well if we add an inline hint to the summand function

julia> @inline sample(_) = rand()^2 + rand()^2 <= 1;

julia> f_sum(N) = sum(sample, 1:N) / N * 4;

julia> @btime f_sum(100000);
  715.944 μs (0 allocations: 0 bytes)

which is comparable to the manual for loop

julia> function f_for(N)
           cnt = 0;
           for i in 1:N
               if rand()^2 + rand()^2 <= 1
                   cnt += 1
               end
           end
           pi = cnt/N * 4
       end;

julia> @btime f_for(100000);
  712.774 μs (0 allocations: 0 bytes)

and faster than the generator version

julia> f_gen(N) = sum(rand()^2 + rand()^2 <= 1 for _ in 1:N) / N * 4;

julia> @btime f_gen(100000);
  1.092 ms (0 allocations: 0 bytes)

I think constructing algorithms by composing well-known function like sum is a great approach since parallelization is as easy as swapping the implementation of sum. For example, by slapping Folds. in front of it

Folds.sum(sample, 1:N) / N * 4

which also works on GPU

using FoldsCUDA
Folds.sum(sample, 1:N, CUDAEx()) / N * 4

and distributed machines

Folds.sum(sample, 1:N, DistributedEX()) / N * 4
7 Likes