Advice for improving Monte-Carlo code

I’m a relative beginner with Julia and was hoping someone may be able to give me some constructive feedback on a small script I’ve written. I’m mainly looking for performance tips and ways to make my code clearer or less verbose. The goal of the script is to a simple Monte-Carlo experiment to calculate the expected distance between two random uniformly distributed points in the unit square. Many thanks.

using Random
using Statistics

#=
Calculates the expected distance between two points in the unit square
picked randomly from a uniform distribution. Exact answer:
(1/15)(2 + sqrt(2) + 5*log(1+sqrt(2)) ≈ 0.5214
=#

trials = 1000

function generate_points(trials)
    a_points = []; b_points = [];

    for i in 1:(2*trials)
        random_point = map(rand, (Float64, Float64))
        if i <= trials
            push!(a_points, random_point)
        else
            push!(b_points, random_point)
        end
    end
    return a_points, b_points
end

function d(a_points, b_points)
    dist = []

    for i in 1:trials
        x = sqrt((a_points[i][2]-b_points[i][2])^2 +
        (a_points[i][1]-b_points[i][1])^2)
        push!(dist, x)
    end
    return dist
end

a, b = generate_points(trials);
distance = d(a, b);
expected_distance = mean(distance)
  1. Avoid using push to increment your data point if you know the final size of the arrays, use, better:
a_point = Vector{Vector{Float64}}(undef,trials)

and later use

a_point[i] = rand(2)

for each element.

Anyway, in your example you are just generating a lot of random points, you could use:

a_points = [ rand(2) for i in 1:trials ]
  1. Same thing with dist, try to allocate it complete if possible, with dist = zeros(trials), or dist=Vector{Float64}(undef,trials), and then use
    dist[i] = sqrt(....)

  2. Even if you had less information a priori about the size of the vectors, it is important to provide which type of element they will contain. For example:

d = []

will perform much worse than

d = Float64[]

Finally, for that specific problem, you do not need to save the points at all (if you only want the final average distance), thus probably you do not need to generate the points and compute the distances in two separate functions and loops. Do all at once generating the points on the flight.

edit: probably this is a homework, please make that clear.

2 Likes

Thanks for the feedback. No, it’s not Homework I’m just trying to learn Julia for personal interest. Probably best not to assume without proper justification.

1 Like
julia> using Random, Statistics, StaticArrays, LinearAlgebra, BenchmarkTools
julia> generate_points(nTrials) = (rand(SVector{2, Float64}, nTrials), rand(SVector{2, Float64}, nTrials))
julia> d(a, b) = map((x,y)->norm(x-y), a, b)
julia> @btime begin
       a,b = generate_points(trials);
       distance = d(a,b);
       meandist = mean(distance);
       end;
  16.110 μs (5 allocations: 39.48 KiB)

Your original code contained a bunch of anti-patterns; primarily, you should almost never initialize an accumulator array with buf = [], because that is equivalent to buf = Vector{Any}(); second, you should use available high-quality libraries like StaticArrays; third, pushing into a Vector in a loop is bad style in cases where you know the desired size beforehand.

3 Likes

Thanks for the reply, It’s really helpful. Didn’t even think about StaticArrays . Just so I can learn, what do you mean by “anti-patterns”?

Probably this is one “anti-pattern”. You are generating a tuple of random points. Tuples are immutable, so this can potentially be more performant than simply rand(2), which generates a vector, which is mutable. Yet, you could have used

random_point = (rand(),rand())

but that would be also somewhat unnatural (yet, for this specific purpose, just fine), if you really want to use static arrays, you would want to the StaticArrays package:

julia> using StaticArrays

julia> random_point = @SVector rand(2)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.9595201727211979
 0.06002658612324363

(but I think using static arrays is a future step here, after solving other more important issues)

Somewhat related, for MC calculations have a look at
https://baggepinnen.github.io/MonteCarloMeasurements.jl/latest/
Awesome package which makes it easy to do MC. Your original example becomes:

julia> using MonteCarloMeasurements, Distributions

julia> n = 1000;

julia> p1 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))]
2-element Array{Particles{Float64,1000},1}:
 0.5 ± 0.29
 0.5 ± 0.29

julia> p2 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))];

julia> dist(p1, p2) = sqrt((p1[1]-p2[1])^2 + (p1[2]-p2[2])^2)
dist (generic function with 1 method)

julia> dist(p1,p2)
Particles{Float64,1000}
 0.523052 ± 0.25

I think it’s fast too, here the timing on a new laptop:

julia> @btime dist(p1,p2)
  9.568 μs (7 allocations: 47.64 KiB)
Particles{Float64,1000}
 0.523052 ± 0.25
4 Likes

Instead of making these points individually, you may wish to keep them as columns of a matrix. In which case this is a possible one-line solution:

julia> using Distances

julia> @btime mean(pairwise(Euclidean(), rand(2, 10^3), dims=2))
  3.188 ms (4 allocations: 7.65 MiB)
0.5246766109743328

julia> @btime mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3)))
  6.929 μs (3 allocations: 39.44 KiB)
0.5127943200515588

Note that the first way is N points and N^2 distances, while the second (and the original post) are 2N points & N distances.

The N^2 way seems to give a smaller deviation, I’m surprised by the ± 0.25 estimate above:

julia> std([mean(pairwise(Euclidean(), rand(2, 10^3), dims=2)) for _ in 1:500])
0.005379150837174151

julia> std([mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3))) for _ in 1:500])
0.008066908056271286
1 Like

Just for the sake of didactics:

The “natural” way to solve that problem is probably:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = rand(2)
           y = rand(2)
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end

julia> @btime meandist(1000)
  88.329 μs (2000 allocations: 187.50 KiB)
0.5351316591605562

This is quite slow and allocates a lot, because of the allocations of x and y random vectors at each iteration of the loop. These allocations can be solved by using static allocations, with tuples:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = (rand(),rand())
           y = (rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.012 μs (0 allocations: 0 bytes)
0.5122565299210201

or with the StaticArrays package:

julia> using StaticArrays

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = SVector{2,Float64}(rand(),rand())
           y = SVector{2,Float64}(rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.027 μs (0 allocations: 0 bytes)
0.5096925569126717

These options are not as fast as the one that uses the Distances package because there the computation of the distances is loop-vectorized. We can do that “by hand” (edit: without manually using @avx the performance is the same - I guess loop vectorization occurs anyway, automatically):

julia> using LoopVectorization

julia> function meandist(n)
         d = 0.
         x = rand(2,n)
         y = rand(2,n)
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  5.006 μs (2 allocations: 31.50 KiB)
0.5336431684976665

and, of course, lot of that time is just the generation of the two matrices:

julia> function meandist(n,x,y)
         d = 0.
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(2,1000); y = rand(2,1000);

julia> @btime meandist(1000,$x,$y)
  1.530 μs (0 allocations: 0 bytes)
0.5345441925830112


Edit: Without generating the list of random points inside the function, the option with static arrays is as fast as the loop-vectorized (perhaps it gets loop-vectorized by the compiler automatically):

julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i][1]-y[i][1])^2+(x[i][2]-y[i][2])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(SVector{2,Float64},1000)

julia> y = rand(SVector{2,Float64},1000)

julia> @btime meandist(1000,$x,$y)
  1.508 μs (0 allocations: 0 bytes)
0.5239587680180073


The equivalent version without static arrays is slower, but not bad:

julia> x = [ rand(2) for i in 1:1000 ];

julia> y = [ rand(2) for i in 1:1000 ];

julia> @btime meandist(1000,$x,$y)
  3.292 μs (0 allocations: 0 bytes)
0.5311847109064113

Edit 2:

Given that allocating the vectors outside the function gives similar results for static arrays or matrices, the difference is in the time required to generate those. There are important differences there, indeed:

julia> @btime [ SVector{2,Float64}(rand(),rand()) for i in 1:1000 ];
  12.237 μs (1 allocation: 15.75 KiB)

julia> @btime rand(SVector{2,Float64},1000);
  3.195 μs (1 allocation: 15.75 KiB)

julia> @btime rand(2,1000);
  1.605 μs (1 allocation: 15.75 KiB)

7 Likes

That’s surprising, and surely fixable.

I think you may not be getting much from @avx here, if I replace it with @inbounds that function is equally quick. But if you permute the matrices, so that it iterates along the dense dimension, then it’s quicker:

julia> @btime meandist(1000,$x,$y); # as before, but on my computer
  1.704 μs (0 allocations: 0 bytes)

julia> @btime meandist0(1000,$x,$y); # version without @avx
  1.674 μs (0 allocations: 0 bytes)

julia> xt = rand(1000, 2)'; yt = rand(1000, 2)';

julia> @btime meandist(1000,$xt,$yt); # iterating along stride=1 direction
  838.082 ns (0 allocations: 0 bytes)

julia> strides(x)
(1, 2)

julia> strides(xt)
(1000, 1)
2 Likes

Excellent, thanks. Yes, for the loop vectorization be efficient the larger dimension should be column-based.

According to Cohen and Courgeau (2019), it seems that the theoretical standard deviation is ~0.248 (variance is ~0.0615)

Oh right, of course there are two standard deviations, for one pair of points, or for all 1000. Adapting my line above, this is the one-pair case, which is clearly what MonteCarloMeasurements was tracking:

julia> std([only(colwise(Euclidean(), rand(2, 1), rand(2, 1))) for _ in 1:500]) 
0.24930807859433118
1 Like

Here is an alternative which can be multithreaded by using ThreadsX.jl and OnlineStats.jl. Also useful for when N is quite large. Even doing 1 billion comparisons only took a handful of seconds for me.

julia> distfun(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
distfun (generic function with 1 method)

julia> m=ThreadsX.reduce(Moments(), distfun(rand(2),rand(2)) for _ in 1:10000000)
Moments: n=10000000 | value=[0.52133, 0.333226, 0.240583, 0.188733]

julia> mean(m)
0.5213296332974621

julia> std(m)
0.24787385140189952

Don’t forget to start Julia with multiple threads, eg. julia -t N

2 Likes

It is also may be useful to use another RNGs

using BenchmarkTools
using RandomNumbers.Xorshifts
using StableRNGs

function meandist(n, rng = Random.GLOBAL_RNG)
    d = 0.

    x = rand(rng, 2, n)
    y = rand(rng, 2, n)
    for i in 1:n
        d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
    end
    return d/n
end

function meandist2(n, rng = Random.GLOBAL_RNG)
    d = 0.

    for i in 1:n
        x = (rand(rng),rand(rng))
        y = (rand(rng),rand(rng))
        d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return d/n
end

julia> @btime meandist(1000)
  5.346 μs (2 allocations: 31.50 KiB)

julia> rng = Xoroshiro128Plus()
julia> @btime meandist2(1000, $rng)
  5.154 μs (0 allocations: 0 bytes)

julia> rng = StableRNG(2020)
julia> @btime meandist2(1000, $rng)
  4.744 μs (0 allocations: 0 bytes)

So other RNGs can be faster and can use non allocating scheme.

1 Like

Again, just for didactical purposes (my aim here is to show that those “one-liners” are not magic, they can be written in Julia, I think that they might be a show-stopper for a new user, because while using package functions is great and must be done, what differeciate Julia from scripted languages is that the performance of such functions can be achieved by the user and, thus, these functions can be customized in every detail depending on the user case):


julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> function meandist_parallel(n,x,y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 1 method)


julia> n = 100_000_000
100000000

julia> x = rand(n,2); y = rand(n,2);

julia> using BenchmarkTools

julia> @btime meandist($n,$x,$y)
  212.870 ms (0 allocations: 0 bytes)
0.5213806013781946

julia> @btime meandist_parallel($n,$x,$y)
  134.889 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> Threads.nthreads()
4

Scaling here is not great, but the idea is simple.

2 Likes

Multi-threading version, which uses random number generation is more involved, it was discussed here

In a simplified version it may look like this

function meandist3(n, rng)
    d = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        k = Threads.threadid()
        x = (rand(rng[k]),rand(rng[k]))
        y = (rand(rng[k]),rand(rng[k]))
        d[k] += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return sum(d)/n
end

And benchmarks

@btime meandist(100_000_000)
  # 1.690 s (4 allocations: 2.98 GiB)

rngs = Tuple(StableRNG.(2020 .+ collect(1:Threads.nthreads())))
@btime meandist3(100_000_000, $rngs)
  # 1.060 s (43 allocations: 6.13 KiB)

It is important, because if you have less then 30Gb it is the only way to run

julia> @time meandist3(1_000_000_000, rngs)
 13.683740 seconds (44 allocations: 6.141 KiB)
0.5214066471370079

EDIT: ha, interestingly enough, MersenneTwister suddenly is the best here

julia> rngs = Tuple(Random.MersenneTwister.(2020 .+ collect(1:Threads.nthreads())))
julia> @time meandist3(1_000_000_000, rngs)
  8.425097 seconds (48 allocations: 6.266 KiB)
0.521398839144987
2 Likes

I have tried to follow the great video from Prof. Edelman here, to sketch a solution using FLoops.jl. This seems to work relatively fast on Win10 laptop with 10 processors, but I do not understand the difference with the Threads approach above:

using FLoops, BenchmarkTools
function meandist_parallel(n,x,y,ncores)
    d = 0
    @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
       @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
    end
    return d/n
end

n = 100_000_000
x = rand(n,2); y = rand(n,2);

julia> @btime meandist_parallel($n,$x,$y,10)   # ncores=10
  106.135 ms (644 allocations: 36.56 KiB)
0.5214049793107156

Thank you.

My short answer: I don’t know.

Since nobody really inside this answered, my long answer:

Here the option with Threads performs as well as that FLoops code you sent just by adding a @inbounds:

julia> function meandist_parallel($n,$x,$y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           @inbounds d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y)
  112.104 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> function meandist_parallel(n,x,y,ncores)
           d = 0
           @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
              @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
           end
           return d/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y,Threads.nthreads())
  112.343 ms (136 allocations: 7.27 KiB)
0.5213806013784907

(and allocates less, thus in this case the first option is in general better).

My two-cents is that there are many packages which have been part of the development of the multi-threading and parallelization development of Julia and now are becoming part of Base or getting deprecated, and that is one of those. But I might be very wrong.

1 Like

No, this is not exactly true, FLoops.jl is definitely not a deprecated package. Maybe the author of the package would come in and explain it better than I, but originally FLoops.jl and it’s brother ThreadsX.jl (you can read corresponding announcements here and here ) are built on the basis of Transducers.jl and they are very flexible and powerful tools for writing high-level multithreading and parallel processing code.

In this particular problem, they do not shine, because this is a typical embarrassingly parallel problem and indeed @Threads.threads macro is enough to solve it. But they are useful in other scenarios, for example, parallel sorting, where simple @threads macro can’t be applied.

4 Likes