Advice for improving Monte-Carlo code

Thanks for the reply, It’s really helpful. Didn’t even think about StaticArrays . Just so I can learn, what do you mean by “anti-patterns”?

Probably this is one “anti-pattern”. You are generating a tuple of random points. Tuples are immutable, so this can potentially be more performant than simply rand(2), which generates a vector, which is mutable. Yet, you could have used

random_point = (rand(),rand())

but that would be also somewhat unnatural (yet, for this specific purpose, just fine), if you really want to use static arrays, you would want to the StaticArrays package:

julia> using StaticArrays

julia> random_point = @SVector rand(2)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.9595201727211979
 0.06002658612324363

(but I think using static arrays is a future step here, after solving other more important issues)

Somewhat related, for MC calculations have a look at
https://baggepinnen.github.io/MonteCarloMeasurements.jl/latest/
Awesome package which makes it easy to do MC. Your original example becomes:

julia> using MonteCarloMeasurements, Distributions

julia> n = 1000;

julia> p1 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))]
2-element Array{Particles{Float64,1000},1}:
 0.5 ± 0.29
 0.5 ± 0.29

julia> p2 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))];

julia> dist(p1, p2) = sqrt((p1[1]-p2[1])^2 + (p1[2]-p2[2])^2)
dist (generic function with 1 method)

julia> dist(p1,p2)
Particles{Float64,1000}
 0.523052 ± 0.25

I think it’s fast too, here the timing on a new laptop:

julia> @btime dist(p1,p2)
  9.568 μs (7 allocations: 47.64 KiB)
Particles{Float64,1000}
 0.523052 ± 0.25
4 Likes

Instead of making these points individually, you may wish to keep them as columns of a matrix. In which case this is a possible one-line solution:

julia> using Distances

julia> @btime mean(pairwise(Euclidean(), rand(2, 10^3), dims=2))
  3.188 ms (4 allocations: 7.65 MiB)
0.5246766109743328

julia> @btime mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3)))
  6.929 μs (3 allocations: 39.44 KiB)
0.5127943200515588

Note that the first way is N points and N^2 distances, while the second (and the original post) are 2N points & N distances.

The N^2 way seems to give a smaller deviation, I’m surprised by the ± 0.25 estimate above:

julia> std([mean(pairwise(Euclidean(), rand(2, 10^3), dims=2)) for _ in 1:500])
0.005379150837174151

julia> std([mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3))) for _ in 1:500])
0.008066908056271286
1 Like

Just for the sake of didactics:

The “natural” way to solve that problem is probably:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = rand(2)
           y = rand(2)
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end

julia> @btime meandist(1000)
  88.329 μs (2000 allocations: 187.50 KiB)
0.5351316591605562

This is quite slow and allocates a lot, because of the allocations of x and y random vectors at each iteration of the loop. These allocations can be solved by using static allocations, with tuples:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = (rand(),rand())
           y = (rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.012 μs (0 allocations: 0 bytes)
0.5122565299210201

or with the StaticArrays package:

julia> using StaticArrays

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = SVector{2,Float64}(rand(),rand())
           y = SVector{2,Float64}(rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.027 μs (0 allocations: 0 bytes)
0.5096925569126717

These options are not as fast as the one that uses the Distances package because there the computation of the distances is loop-vectorized. We can do that “by hand” (edit: without manually using @avx the performance is the same - I guess loop vectorization occurs anyway, automatically):

julia> using LoopVectorization

julia> function meandist(n)
         d = 0.
         x = rand(2,n)
         y = rand(2,n)
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  5.006 μs (2 allocations: 31.50 KiB)
0.5336431684976665

and, of course, lot of that time is just the generation of the two matrices:

julia> function meandist(n,x,y)
         d = 0.
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(2,1000); y = rand(2,1000);

julia> @btime meandist(1000,$x,$y)
  1.530 μs (0 allocations: 0 bytes)
0.5345441925830112


Edit: Without generating the list of random points inside the function, the option with static arrays is as fast as the loop-vectorized (perhaps it gets loop-vectorized by the compiler automatically):

julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i][1]-y[i][1])^2+(x[i][2]-y[i][2])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(SVector{2,Float64},1000)

julia> y = rand(SVector{2,Float64},1000)

julia> @btime meandist(1000,$x,$y)
  1.508 μs (0 allocations: 0 bytes)
0.5239587680180073


The equivalent version without static arrays is slower, but not bad:

julia> x = [ rand(2) for i in 1:1000 ];

julia> y = [ rand(2) for i in 1:1000 ];

julia> @btime meandist(1000,$x,$y)
  3.292 μs (0 allocations: 0 bytes)
0.5311847109064113

Edit 2:

Given that allocating the vectors outside the function gives similar results for static arrays or matrices, the difference is in the time required to generate those. There are important differences there, indeed:

julia> @btime [ SVector{2,Float64}(rand(),rand()) for i in 1:1000 ];
  12.237 μs (1 allocation: 15.75 KiB)

julia> @btime rand(SVector{2,Float64},1000);
  3.195 μs (1 allocation: 15.75 KiB)

julia> @btime rand(2,1000);
  1.605 μs (1 allocation: 15.75 KiB)

7 Likes

That’s surprising, and surely fixable.

I think you may not be getting much from @avx here, if I replace it with @inbounds that function is equally quick. But if you permute the matrices, so that it iterates along the dense dimension, then it’s quicker:

julia> @btime meandist(1000,$x,$y); # as before, but on my computer
  1.704 μs (0 allocations: 0 bytes)

julia> @btime meandist0(1000,$x,$y); # version without @avx
  1.674 μs (0 allocations: 0 bytes)

julia> xt = rand(1000, 2)'; yt = rand(1000, 2)';

julia> @btime meandist(1000,$xt,$yt); # iterating along stride=1 direction
  838.082 ns (0 allocations: 0 bytes)

julia> strides(x)
(1, 2)

julia> strides(xt)
(1000, 1)
2 Likes

Excellent, thanks. Yes, for the loop vectorization be efficient the larger dimension should be column-based.

According to Cohen and Courgeau (2019), it seems that the theoretical standard deviation is ~0.248 (variance is ~0.0615)

Oh right, of course there are two standard deviations, for one pair of points, or for all 1000. Adapting my line above, this is the one-pair case, which is clearly what MonteCarloMeasurements was tracking:

julia> std([only(colwise(Euclidean(), rand(2, 1), rand(2, 1))) for _ in 1:500]) 
0.24930807859433118
1 Like

Here is an alternative which can be multithreaded by using ThreadsX.jl and OnlineStats.jl. Also useful for when N is quite large. Even doing 1 billion comparisons only took a handful of seconds for me.

julia> distfun(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
distfun (generic function with 1 method)

julia> m=ThreadsX.reduce(Moments(), distfun(rand(2),rand(2)) for _ in 1:10000000)
Moments: n=10000000 | value=[0.52133, 0.333226, 0.240583, 0.188733]

julia> mean(m)
0.5213296332974621

julia> std(m)
0.24787385140189952

Don’t forget to start Julia with multiple threads, eg. julia -t N

2 Likes

It is also may be useful to use another RNGs

using BenchmarkTools
using RandomNumbers.Xorshifts
using StableRNGs

function meandist(n, rng = Random.GLOBAL_RNG)
    d = 0.

    x = rand(rng, 2, n)
    y = rand(rng, 2, n)
    for i in 1:n
        d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
    end
    return d/n
end

function meandist2(n, rng = Random.GLOBAL_RNG)
    d = 0.

    for i in 1:n
        x = (rand(rng),rand(rng))
        y = (rand(rng),rand(rng))
        d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return d/n
end

julia> @btime meandist(1000)
  5.346 μs (2 allocations: 31.50 KiB)

julia> rng = Xoroshiro128Plus()
julia> @btime meandist2(1000, $rng)
  5.154 μs (0 allocations: 0 bytes)

julia> rng = StableRNG(2020)
julia> @btime meandist2(1000, $rng)
  4.744 μs (0 allocations: 0 bytes)

So other RNGs can be faster and can use non allocating scheme.

1 Like

Again, just for didactical purposes (my aim here is to show that those “one-liners” are not magic, they can be written in Julia, I think that they might be a show-stopper for a new user, because while using package functions is great and must be done, what differeciate Julia from scripted languages is that the performance of such functions can be achieved by the user and, thus, these functions can be customized in every detail depending on the user case):


julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> function meandist_parallel(n,x,y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 1 method)


julia> n = 100_000_000
100000000

julia> x = rand(n,2); y = rand(n,2);

julia> using BenchmarkTools

julia> @btime meandist($n,$x,$y)
  212.870 ms (0 allocations: 0 bytes)
0.5213806013781946

julia> @btime meandist_parallel($n,$x,$y)
  134.889 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> Threads.nthreads()
4

Scaling here is not great, but the idea is simple.

2 Likes

Multi-threading version, which uses random number generation is more involved, it was discussed here

In a simplified version it may look like this

function meandist3(n, rng)
    d = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        k = Threads.threadid()
        x = (rand(rng[k]),rand(rng[k]))
        y = (rand(rng[k]),rand(rng[k]))
        d[k] += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return sum(d)/n
end

And benchmarks

@btime meandist(100_000_000)
  # 1.690 s (4 allocations: 2.98 GiB)

rngs = Tuple(StableRNG.(2020 .+ collect(1:Threads.nthreads())))
@btime meandist3(100_000_000, $rngs)
  # 1.060 s (43 allocations: 6.13 KiB)

It is important, because if you have less then 30Gb it is the only way to run

julia> @time meandist3(1_000_000_000, rngs)
 13.683740 seconds (44 allocations: 6.141 KiB)
0.5214066471370079

EDIT: ha, interestingly enough, MersenneTwister suddenly is the best here

julia> rngs = Tuple(Random.MersenneTwister.(2020 .+ collect(1:Threads.nthreads())))
julia> @time meandist3(1_000_000_000, rngs)
  8.425097 seconds (48 allocations: 6.266 KiB)
0.521398839144987
2 Likes

I have tried to follow the great video from Prof. Edelman here, to sketch a solution using FLoops.jl. This seems to work relatively fast on Win10 laptop with 10 processors, but I do not understand the difference with the Threads approach above:

using FLoops, BenchmarkTools
function meandist_parallel(n,x,y,ncores)
    d = 0
    @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
       @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
    end
    return d/n
end

n = 100_000_000
x = rand(n,2); y = rand(n,2);

julia> @btime meandist_parallel($n,$x,$y,10)   # ncores=10
  106.135 ms (644 allocations: 36.56 KiB)
0.5214049793107156

Thank you.

My short answer: I don’t know.

Since nobody really inside this answered, my long answer:

Here the option with Threads performs as well as that FLoops code you sent just by adding a @inbounds:

julia> function meandist_parallel($n,$x,$y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           @inbounds d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y)
  112.104 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> function meandist_parallel(n,x,y,ncores)
           d = 0
           @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
              @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
           end
           return d/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y,Threads.nthreads())
  112.343 ms (136 allocations: 7.27 KiB)
0.5213806013784907

(and allocates less, thus in this case the first option is in general better).

My two-cents is that there are many packages which have been part of the development of the multi-threading and parallelization development of Julia and now are becoming part of Base or getting deprecated, and that is one of those. But I might be very wrong.

1 Like

No, this is not exactly true, FLoops.jl is definitely not a deprecated package. Maybe the author of the package would come in and explain it better than I, but originally FLoops.jl and it’s brother ThreadsX.jl (you can read corresponding announcements here and here ) are built on the basis of Transducers.jl and they are very flexible and powerful tools for writing high-level multithreading and parallel processing code.

In this particular problem, they do not shine, because this is a typical embarrassingly parallel problem and indeed @Threads.threads macro is enough to solve it. But they are useful in other scenarios, for example, parallel sorting, where simple @threads macro can’t be applied.

4 Likes

@lmiq thank you.
For the record, for n = 1_500_000_000, on a 64GB RAM Win10 laptop with 11 processors used, the results for the two parallel solutions above stay close in terms of speed using @btime:

@floop:   1.568 s (645 allocations: 36.59 KiB)
Threads.@threads:  1.576 s (58 allocations: 9.50 KiB)

When n equals 2 billion, the Julia code still runs on my laptop but x,y fill the RAM up and the @btime results seem to be non-repeatable.

FYI, with FLoops.jl, FoldsCUDA.jl, and Random123.jl, you can easily move this to GPU with a small modification:

using CUDA
using FLoops
using FoldsCUDA
using Random123

function counters(n)
    stride = typemax(UInt64) ÷ n
    return UInt64(0):stride:typemax(UInt64) - stride
end

function meandist_floop(n; m=10_000, ex=has_cuda_gpu() ? CUDAEx() : ThreadedEx())
    @floop ex for ctr in counters(n)
        rng = set_counter!(Philox2x(0), ctr)
        s = 0.0
        for _ in 1:m
            x1 = rand(rng)
            y1 = rand(rng)
            x2 = rand(rng)
            y2 = rand(rng)
            s += sqrt((x1 - y1)^2 + (x2 - y2)^2)
        end
        @reduce(d = 0.0 + s)
    end
    return d / (n * m)
end

meandist_floop(2^10) # run on GPU if you have one
meandist_floop(2^10; ex = ThreadedEx()) # forcing to run on CPU threads
meandist_floop(2^10; ex = DistributedEx()) # run on multiple processes/nodes

For an explanation on how to use counter-based RNG on GPU, see: Monte-Carlo π · FoldsCUDA

5 Likes

@tkf: would it be possible to shed some more light on your top-class post:

  • On Windows 10 laptop the GPU option is not available (NVIDIA Quadro P3200), while the ThreadedEx and DistributedEx options do run similarly fast (12 Intel Xeon E-2186 CPUs). What is the difference between these two options?

  • Could you provide a basic explanation on the counter-based RNG logic? (i.e., why the problem is decoupled in n x m loops). The link that you have provided is rather hard to grasp.

Thank you in advance.

ThreadedEx uses threads (Threads.@spawn) and DistributedEx uses multiple processes (Distributed.jl). Both let us use multiple CPUs but the latter let us use multiple machines. But if you are using only one machine, I’d try threading first. It’s much easier to share things and has less overhead.

But, unfortunately, choosing threading vs multi-processing is a bit complicated in Julia and case-specific. I’m trying to explain a useful guideline here: Frequently asked questions

I’m not a PRNG specialist but here’s my understanding :sweat_smile:. Recall that all PRNGs are “just” clever state machine with a large but finite period (the minimum number of iterations that takes for the PRNG to come back to the initial state). What is specific about counter-based PRNG is that you can set the state of the PRNG to the arbitrary point (“phase”) of the period with a very small computation. That’s the set_counter! function I’m calling. So, since we know the period of the PRNG and there is a way to choose where to start, it lets us evenly split the entire period into chunks and use each chunk locally within a thread.

I’m introducing the inner loop because set_counter! has an overhead comparable to the actual random number generation. So, it makes sense to use the RNG for a while after you set the counter.

4 Likes