Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba)

By the way, there’s no need to do this. BenchmarkTools @btime already takes care of this for you.

I’m pretty sure this is yet another example of https://github.com/JuliaLang/julia/issues/15276 due to the closure created by @threads. You can see the problem by doing:

julia> @code_warntype estimate_pi_thread(1000)
Variables
  #self#::Core.Const(estimate_pi_thread)
  nMC::Int64
  threadsfor_fun::var"#25#threadsfor_fun#8"{Float64, Float64, UnitRange{Int64}}
  n_circle@_4::Core.Box

Using the usual Ref trick solves that problem:

function estimate_pi_thread(nMC)
    radius = 1.
    diameter = 2. * radius
    n_circle = Ref(0)
    Threads.@threads for i in 1:nMC
        x = (rand() - 0.5) * diameter
        y = (rand() - 0.5) * diameter
        r = sqrt(x^2 + y^2)
        if r <= radius
            n_circle[] += 1
        end
    end
    return (n_circle[] / nMC) * 4.
end

With that change, I see the following with julia --threads=1 (only one thread to work with):

julia> @btime estimate_pi($nMC2)
  88.248 ms (0 allocations: 0 bytes)
3.1414028

julia> @btime estimate_pi_thread($nMC2)
  137.082 ms (7 allocations: 576 bytes)
3.1417592

With julia --threads=2, I see:

julia> @btime estimate_pi_thread($nMC2)
  114.105 ms (12 allocations: 1.00 KiB)
1.7710592

and with --threads=4 I see:

julia> @btime estimate_pi_thread($nMC2)
  110.060 ms (22 allocations: 1.89 KiB)
0.932612

so indeed there’s not much benefit from this particular manner of threading. I would guess that there’s too little work being done in each iteration for the overhead of threading to make sense.

4 Likes

The threaded code is not correct, because you are concurrently adding to n_circles. You should at least save them in an array:

julia> # threaded version (think this is the right approach)
       function estimate_pi_thread(nMC)
           radius = 1.
           diameter = 2. * radius
           n_circle = zeros(Int,Threads.nthreads())
           Threads.@threads for i in 1:nMC
               x = (rand() - 0.5) * diameter
               y = (rand() - 0.5) * diameter
               r = sqrt(x^2 + y^2)
               if r <= radius
                   n_circle[Threads.threadid()] += 1
               end
           end
           return (sum(n_circle) / nMC) * 4.
       end
estimate_pi_thread (generic function with 1 method)

That seems to solve most of the problem:

julia> @btime estimate_pi_thread(nMC2)
  76.575 ms (23 allocations: 2.00 KiB)
3.1414848

julia> @btime estimate_pi($nMC2)
  103.041 ms (0 allocations: 0 bytes)
3.1408116

(I think the numbe code is not correct either)

7 Likes

Good point! I didn’t even notice that the answers I was getting with the threaded code were completely wrong. @smpurkis the fact that you got correct answers in your threaded version might suggest that you aren’t actually using multiple threads at all. Are you starting julia with julia --threads=<something> ?

Also there are better parallelization models for a loop with small computations inside, like that of Floops:

julia> using FLoops
       # threaded version (think this is the right approach)
       function estimate_pi_floop(nMC)
           radius = 1.
           diameter = 2. * radius
           @floop for i in 1:nMC
               x = (rand() - 0.5) * diameter
               y = (rand() - 0.5) * diameter
               r = sqrt(x^2 + y^2)
               if r <= radius
                   @reduce(n_circle += 1)
               end
           end
           return (n_circle / nMC) * 4.
       end
estimate_pi_floop (generic function with 1 method)


julia> @btime estimate_pi_floop($nMC2)
  24.345 ms (58 allocations: 2.80 KiB)
3.1413176


10 Likes

Aside from the threading, there are a few other tricks you can use:

  • Drop the sqrt and compare with the radius squared (this goes for python too)
  • Explicitly pass the rng to rand
  • Drop the comparison branch
using Random: default_rng
function est_pi(N)
    radius = 1.0
    rad2 = radius^2
    diameter = 2 * radius
    n_circle = 0
    rng = default_rng()
    for i in 1:N
        x = (rand(rng) - 0.5) * diameter
        y = (rand(rng) - 0.5) * diameter
        r2 = x^2 + y^2  # look, no sqrt!
        n_circle += (r2 < rad2)  # no branch
    end
    return 4 * n_circle / N
end
6 Likes

I set export JULIA_NUM_THREADS = 4 before running the code in REPL. I presumed that would take affect in the REPL

it should work Maybe those spaces between S = 4 are wrong? (in bash they error)

you can just use:

julia -t4

1 Like

Flip you are totally right! Removing the spaces, now gives a wrong estimate

julia> @btime pi_est = estimate_pi_thread(nMC2)
  367.631 ms (7851824 allocations: 119.81 MiB)
0.899592

by the way, you check if you are using threads within the Julia session by calling Threads.nthreads()

1 Like

Thank you for the tip! Will take a look at FLoops package. Am still very new to Julia, only recently heard about LoopVectorization, tried it with this using @avx but gave incorrect results. Learning when to use which optimisations

That will also sort of parallelize the loop, so you will have the same concurrency problem (and wrong results). Loop vectorization should be used when the operations are independent on each iteration of the loop.

Thank you for these tips! I didn’t realize there were such simple optimisations for this. I mostly used the code from Python+Numba vs. Julia

Take a look at this Twitter thread:
https://twitter.com/markkitti/status/1232520018644131840

3 Likes

I tried it like this

using LoopVectorization
using BenchmarkTools

function XYinCircle(radius, diameter)
    x = (rand() - 0.5) * diameter
    y = (rand() - 0.5) * diameter
    r = sqrt(x^2 + y^2)
    if r <= radius
        return 1
    end
    return 0
end

function estimate_pi_avx(nMC)
    radius = 1.
    diameter = 2. * radius
    n_circles = Vector{Bool}(undef, nMC)
    @avx for i in 1:nMC
        n_circles[i] = XYinCircle(radius, diameter)  # independent from each other
    end
    n_circle = sum(n_circles)
    return (n_circle / nMC) * 4.
end

In REPL:

julia> estimate_pi_avx(100000)
4.0

Seems n_circles is either all 0s or all 1s using @avx

Thanks, have taken a look, it is amazing that Julia can be that fast! Have to say the optimised Julia version goes waaaay over my head XD

It can be annoying, but multithreaded rand is (currently) fastest if you manually hoist out the access to the thread-local global RNG:

3 Likes

Indeed, that speeds up by a factor of 2 both alternatives. Copying the solution from here: Random number and parallel execution - #16 by greg_plowman

We have:

julia> Threads.nthreads()
4

julia> @btime estimate_pi_thread(10000000)
  35.675 ms (22 allocations: 1.98 KiB)
3.1413444

julia> @btime estimate_pi_floop(10000000)
  12.244 ms (57 allocations: 2.77 KiB)
3.1405184

But the code is quite boring for that:

Code
using Random, Future, BenchmarkTools, FLoops

function parallel_rngs(rng::MersenneTwister, n::Integer)
    step = big(10)^20
    rngs = Vector{MersenneTwister}(undef, n)
    rngs[1] = copy(rng)
    for i = 2:n
        rngs[i] = Future.randjump(rngs[i-1], step)
    end
    return rngs
end

const N = Threads.nthreads() * 10^8
const rng = MersenneTwister();
const rngs = parallel_rngs(MersenneTwister(), Threads.nthreads());

function estimate_pi_thread(nMC)
    radius = 1.
    diameter = 2. * radius 
    n_circle = zeros(Int,Threads.nthreads())
    Threads.@threads for i in 1:nMC
        rng = rngs[Threads.threadid()]
        x = (rand(rng) - 0.5) * diameter
        y = (rand(rng) - 0.5) * diameter
        r = sqrt(x^2 + y^2) 
        if r <= radius 
            n_circle[Threads.threadid()] += 1
        end    
    end    
    return (sum(n_circle) / nMC) * 4.
end 

function estimate_pi_floop(nMC)
  radius = 1.
  diameter = 2. * radius
  @floop for i in 1:nMC
    rng = rngs[Threads.threadid()]               
    x = (rand(rng) - 0.5) * diameter
    y = (rand(rng) - 0.5) * diameter
    r  = sqrt(x^2 + y^2)
    if r <= radius
      @reduce(n_circle += 1)
    end
  end
return (n_circle / nMC) * 4.
end




1 Like

Probably they aren’t, since rand() generates a sequence. I think you cannot use loop vectorization with a rand inside. Others may be more specific about that, but maybe that could simply error out.

Look how things don’t go well:

julia> function f() # with avx
         s = 0.
         @avx for i in 1:100
            s += rand()
         end
         s/100
       end
f (generic function with 1 method)

julia> f()  # results vary a lot
0.7592638464484608

julia> f()
0.4069099370815508

julia> f()
0.8003393396344347

julia> function f() # no avx
         s = 0.
         for i in 1:100
            s += rand()
         end
         s/100
       end
f (generic function with 1 method)

julia> f() # always close to 0.5
0.5316257501143982

julia> f()
0.5244277946478424

julia> f()
0.47138753152473184

julia> f()
0.49949114253662863

2 Likes

I would like to further clarify the conclusion about the multi-threaded version. (Jupyter notebook)

Single-threaded version:

using BenchmarkTools
using Random

function mcpi(N)
    rng = MersenneTwister()
    c = 0
    for _ in 1:N
        c += rand(rng)^2 + rand(rng)^2 ≤ 1
    end
    4c/N
end

println("Julia v", VERSION)
print("mcpi(10^8):")
@btime mcpi(10^8)

Julia v1.6.2
mcpi(10^8): 222.733 ms (12 allocations: 19.66 KiB)
3.14193032

A proper multi-threaded version of this gets a bit complicated:

using Base.Threads
using Distributed: splitrange
using Random

function pi_mcmc_julia5(N)
    ranges = splitrange(1, N, nthreads())
    a = Atomic{Int}(0)
    @threads for ran in ranges
        rng = MersenneTwister()
        c = 0
        for _ in ran
            c += rand(rng)^2 + rand(rng)^2 ≤ 1
        end
        atomic_add!(a, c)
    end
    4a[]/N
end

println("Julia v", VERSION)
@show Threads.nthreads()
print("pi_mcmc_julia5(10^8):")
@btime pi_mcmc_julia5(10^8)

Julia v1.6.2
Threads.nthreads() = 12
pi_mcmc_julia5(10^8): 37.794 ms (207 allocations: 242.33 KiB)
3.14158332

223 ms → (12 threads) → 38 ms

The reason for the complexity is that it requires pre-processing and post-processing of the for loop in each thread.

So I created, a few months ago, the @my_theads macro to specify the pre-processing and post-processing in each thread.

Definition of `@my_threads` (long)
# The following code is a modified version of
# 
#    function _threadsfor(iter, lbody, schedule) 
#    macro threads(args...)
#
# in https://github.com/JuliaLang/julia/blob/9f3265399227fbfc4f0160ec3592a9262bd3eb5f/base/threadingconstructs.jl
#
# Its license is MIT: https://julialang.org/license

using Base.Threads
using Base.Threads: threading_run

function _my_threadsfor(iter, lbody, prebody, postbody, schedule)
    lidx = iter.args[1]         # index
    range = iter.args[2]
    quote
        local threadsfor_fun
        let range = $(esc(range))
        function threadsfor_fun(onethread=false)
            r = range # Load into local variable
            lenr = length(r)
            # divide loop iterations among threads
            if onethread
                tid = 1
                len, rem = lenr, 0
            else
                tid = threadid()
                len, rem = divrem(lenr, nthreads())
            end
            # not enough iterations for all the threads?
            if len == 0
                if tid > rem
                    return
                end
                len, rem = 1, 0
            end
            # compute this thread's iterations
            f = firstindex(r) + ((tid-1) * len)
            l = f + len - 1
            # distribute remaining iterations evenly
            if rem > 0
                if tid <= rem
                    f = f + (tid-1)
                    l = l + tid
                else
                    f = f + rem
                    l = l + rem
                end
            end
            # run prebody
            $(esc(prebody))
            # run this thread's iterations
            for i = f:l
                local $(esc(lidx)) = @inbounds r[i]
                $(esc(lbody))
            end
            # run postbody
            $(esc(postbody))
        end
        end
        if threadid() != 1 || ccall(:jl_in_threaded_region, Cint, ()) != 0
            $(if schedule === :static
              :(error("`@my_threads :static` can only be used from thread 1 and not nested"))
              else
              # only use threads when called from thread 1, outside @threads
              :(Base.invokelatest(threadsfor_fun, true))
              end)
        else
            threading_run(threadsfor_fun)
        end
        nothing
    end
end

"""
    @my_threads
A macro to parallelize a `for` loop to run with multiple threads. 
It splits the iteration space among multiple tasks with `prebody` and `postbody`.
It runs those tasks on threads according to a scheduling policy.
Usage:
```julia
@my_threads [schedule] begin
    prebody
end for ...
    ...
end begin
    postbody
end

“”"
macro my_threads(args…)
na = length(args)
if na == 4
sched, prebody, ex, bostbody = args
if sched isa QuoteNode
sched = sched.value
elseif sched isa Symbol
# for now only allow quoted symbols
sched = nothing
end
if sched !== :static
throw(ArgumentError(“unsupported schedule argument in @threads”))
end
elseif na == 3
sched = :default
prebody, ex, postbody = args
else
throw(ArgumentError(“wrong number of arguments in @my_threads”))
end
if !(isa(ex, Expr) && ex.head === :for)
throw(ArgumentError(“@my_threads requires a for loop expression”))
end
if !(ex.args[1] isa Expr && ex.args[1].head === :(=))
throw(ArgumentError(“nested outer loops are not currently supported by @my_threads”))
end
return _my_threadsfor(ex.args[1], ex.args[2], prebody, postbody, sched)
end

The @my_threads macro makes it simple to write the code of the multi-threaded version as below:

using Random

function mcpi_my_threads(N)
    a = Atomic{Int}(0)
    @my_threads begin
        rng = MersenneTwister()
        c = 0
    end for _ in 1:N
        c += rand(rng)^2 + rand(rng)^2 ≤ 1
    end begin
        atomic_add!(a, c)
    end
    4a[]/N
end

println("Julia v", VERSION)
@show Threads.nthreads()
print("mcpi_my_threads(10^8):")
@btime mcpi_my_threads(10^8)

Julia v1.6.2
Threads.nthreads() = 12
mcpi_my_threads(10^8): 36.002 ms (217 allocations: 242.23 KiB)
3.14150104

3 Likes

In my Win 10 environment, the timing results of the Monte Carlo estimation of \pi by CPU are as follows:

  • CPU 1 thread: 223 ms
  • CPU 12 threads: 38 ms

I have tried the GPU CUDA version. (Jupyter notebook)

CPU 1 thread version for Float32:

using BenchmarkTools
using Random
function mcpi_f32(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng, Float32)^2 + rand(rng, Float32)^2 ≤ 1, 1:n)/n
end
@show mcpi_f32(10^8)
@btime mcpi_f32(10^8);
mcpi_f32(10 ^ 8) = 3.14160556
  239.164 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float32:

using BenchmarkTools
using CUDA
mcpi_f32_cu(n) = 4count(x -> x^2 + rand(Float32)^2 ≤ 1, CUDA.rand(n))/n
@show mcpi_f32_cu(10^8)
@btime mcpi_f32_cu(10^8);
mcpi_f32_cu(10 ^ 8) = 3.14173908
  6.894 ms (8189 allocations: 257.55 KiB)

CPU 1 thread version for Float64:

using BenchmarkTools
using Random
function mcpi_f64(n)
    rng = MersenneTwister()
    4count(_ -> rand(rng)^2 + rand(rng)^2 ≤ 1, 1:n)/n
end
@show mcpi_f64(10^8)
@btime mcpi_f64(10^8);
mcpi_f64(10 ^ 8) = 3.14161428
  225.706 ms (12 allocations: 19.66 KiB)

My very simple only 1 line CUDA version for Float64:

using BenchmarkTools
using CUDA
mcpi_f64_cu(n) = 4count(x -> x^2 + rand(Float64)^2 ≤ 1, CUDA.rand(Float64, n))/n
@show mcpi_f64_cu(10^8)
@btime mcpi_f64_cu(10^8);
mcpi_f64_cu(10 ^ 8) = 3.1418442
  12.518 ms (15601 allocations: 489.17 KiB)
My CUDA versioninfo
CUDA.versioninfo()
CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 462.31.0

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+462.31
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce GTX 1650 Ti (sm_75, 833.594 MiB / 4.000 GiB available)

Summary: n = 10^8

Previous post:

  • Float64 CPU 1 thread: 223 ms
  • Float64 CPU 12 threads: 38 ms

Only 1 line versions:

  • Float32 CPU 1 thread: 239 ms
  • Float32 GPU CUDA: 6.9 ms
  • Float64 CPU 1 thread: 226 ms
  • Float64 GPU CUDA: 12.5 ms
1 Like