Advice for improving Monte-Carlo code

@lmiq thank you.
For the record, for n = 1_500_000_000, on a 64GB RAM Win10 laptop with 11 processors used, the results for the two parallel solutions above stay close in terms of speed using @btime:

@floop:   1.568 s (645 allocations: 36.59 KiB)
Threads.@threads:  1.576 s (58 allocations: 9.50 KiB)

When n equals 2 billion, the Julia code still runs on my laptop but x,y fill the RAM up and the @btime results seem to be non-repeatable.

FYI, with FLoops.jl, FoldsCUDA.jl, and Random123.jl, you can easily move this to GPU with a small modification:

using CUDA
using FLoops
using FoldsCUDA
using Random123

function counters(n)
    stride = typemax(UInt64) ÷ n
    return UInt64(0):stride:typemax(UInt64) - stride
end

function meandist_floop(n; m=10_000, ex=has_cuda_gpu() ? CUDAEx() : ThreadedEx())
    @floop ex for ctr in counters(n)
        rng = set_counter!(Philox2x(0), ctr)
        s = 0.0
        for _ in 1:m
            x1 = rand(rng)
            y1 = rand(rng)
            x2 = rand(rng)
            y2 = rand(rng)
            s += sqrt((x1 - y1)^2 + (x2 - y2)^2)
        end
        @reduce(d = 0.0 + s)
    end
    return d / (n * m)
end

meandist_floop(2^10) # run on GPU if you have one
meandist_floop(2^10; ex = ThreadedEx()) # forcing to run on CPU threads
meandist_floop(2^10; ex = DistributedEx()) # run on multiple processes/nodes

For an explanation on how to use counter-based RNG on GPU, see: Monte-Carlo π · FoldsCUDA

5 Likes

@tkf: would it be possible to shed some more light on your top-class post:

  • On Windows 10 laptop the GPU option is not available (NVIDIA Quadro P3200), while the ThreadedEx and DistributedEx options do run similarly fast (12 Intel Xeon E-2186 CPUs). What is the difference between these two options?

  • Could you provide a basic explanation on the counter-based RNG logic? (i.e., why the problem is decoupled in n x m loops). The link that you have provided is rather hard to grasp.

Thank you in advance.

ThreadedEx uses threads (Threads.@spawn) and DistributedEx uses multiple processes (Distributed.jl). Both let us use multiple CPUs but the latter let us use multiple machines. But if you are using only one machine, I’d try threading first. It’s much easier to share things and has less overhead.

But, unfortunately, choosing threading vs multi-processing is a bit complicated in Julia and case-specific. I’m trying to explain a useful guideline here: Frequently asked questions

I’m not a PRNG specialist but here’s my understanding :sweat_smile:. Recall that all PRNGs are “just” clever state machine with a large but finite period (the minimum number of iterations that takes for the PRNG to come back to the initial state). What is specific about counter-based PRNG is that you can set the state of the PRNG to the arbitrary point (“phase”) of the period with a very small computation. That’s the set_counter! function I’m calling. So, since we know the period of the PRNG and there is a way to choose where to start, it lets us evenly split the entire period into chunks and use each chunk locally within a thread.

I’m introducing the inner loop because set_counter! has an overhead comparable to the actual random number generation. So, it makes sense to use the RNG for a while after you set the counter.

4 Likes