Advice for improving Monte-Carlo code

rafael.guerra · November 9, 2020, 8:29am

@lmiq thank you.
For the record, for n = 1_500_000_000, on a 64GB RAM Win10 laptop with 11 processors used, the results for the two parallel solutions above stay close in terms of speed using @btime:

@floop:   1.568 s (645 allocations: 36.59 KiB)
Threads.@threads:  1.576 s (58 allocations: 9.50 KiB)

When n equals 2 billion, the Julia code still runs on my laptop but x,y fill the RAM up and the @btime results seem to be non-repeatable.

tkf · November 11, 2020, 2:02am

FYI, with FLoops.jl, FoldsCUDA.jl, and Random123.jl, you can easily move this to GPU with a small modification:

using CUDA
using FLoops
using FoldsCUDA
using Random123

function counters(n)
    stride = typemax(UInt64) ÷ n
    return UInt64(0):stride:typemax(UInt64) - stride
end

function meandist_floop(n; m=10_000, ex=has_cuda_gpu() ? CUDAEx() : ThreadedEx())
    @floop ex for ctr in counters(n)
        rng = set_counter!(Philox2x(0), ctr)
        s = 0.0
        for _ in 1:m
            x1 = rand(rng)
            y1 = rand(rng)
            x2 = rand(rng)
            y2 = rand(rng)
            s += sqrt((x1 - y1)^2 + (x2 - y2)^2)
        end
        @reduce(d = 0.0 + s)
    end
    return d / (n * m)
end

meandist_floop(2^10) # run on GPU if you have one
meandist_floop(2^10; ex = ThreadedEx()) # forcing to run on CPU threads
meandist_floop(2^10; ex = DistributedEx()) # run on multiple processes/nodes

For an explanation on how to use counter-based RNG on GPU, see: Monte-Carlo π · FoldsCUDA

rafael.guerra · November 11, 2020, 6:29pm

@tkf: would it be possible to shed some more light on your top-class post:

On Windows 10 laptop the GPU option is not available (NVIDIA Quadro P3200), while the ThreadedEx and DistributedEx options do run similarly fast (12 Intel Xeon E-2186 CPUs). What is the difference between these two options?
Could you provide a basic explanation on the counter-based RNG logic? (i.e., why the problem is decoupled in n x m loops). The link that you have provided is rather hard to grasp.

Thank you in advance.

tkf · November 11, 2020, 9:17pm

ThreadedEx uses threads (Threads.@spawn) and DistributedEx uses multiple processes (Distributed.jl). Both let us use multiple CPUs but the latter let us use multiple machines. But if you are using only one machine, I’d try threading first. It’s much easier to share things and has less overhead.

But, unfortunately, choosing threading vs multi-processing is a bit complicated in Julia and case-specific. I’m trying to explain a useful guideline here: Frequently asked questions

I’m not a PRNG specialist but here’s my understanding . Recall that all PRNGs are “just” clever state machine with a large but finite period (the minimum number of iterations that takes for the PRNG to come back to the initial state). What is specific about counter-based PRNG is that you can set the state of the PRNG to the arbitrary point (“phase”) of the period with a very small computation. That’s the set_counter! function I’m calling. So, since we know the period of the PRNG and there is a way to choose where to start, it lets us evenly split the entire period into chunks and use each chunk locally within a thread.

I’m introducing the inner loop because set_counter! has an overhead comparable to the actual random number generation. So, it makes sense to use the RNG for a while after you set the counter.

Topic		Replies	Views
Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba) Performance performance	20	4039	August 22, 2021
Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling Performance	7	1501	June 20, 2021
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5214	November 13, 2018
Random number and parallel execution New to Julia	24	2227	March 15, 2021
A simple Monte Carlo in Julia and C++ New to Julia question	15	6840	March 15, 2018

Advice for improving Monte-Carlo code

Related topics