Advice for improving Monte-Carlo code

s0916314 · November 8, 2020, 10:41am

I’m a relative beginner with Julia and was hoping someone may be able to give me some constructive feedback on a small script I’ve written. I’m mainly looking for performance tips and ways to make my code clearer or less verbose. The goal of the script is to a simple Monte-Carlo experiment to calculate the expected distance between two random uniformly distributed points in the unit square. Many thanks.

using Random
using Statistics

#=
Calculates the expected distance between two points in the unit square
picked randomly from a uniform distribution. Exact answer:
(1/15)(2 + sqrt(2) + 5*log(1+sqrt(2)) ≈ 0.5214
=#

trials = 1000

function generate_points(trials)
    a_points = []; b_points = [];

    for i in 1:(2*trials)
        random_point = map(rand, (Float64, Float64))
        if i <= trials
            push!(a_points, random_point)
        else
            push!(b_points, random_point)
        end
    end
    return a_points, b_points
end

function d(a_points, b_points)
    dist = []

    for i in 1:trials
        x = sqrt((a_points[i][2]-b_points[i][2])^2 +
        (a_points[i][1]-b_points[i][1])^2)
        push!(dist, x)
    end
    return dist
end

a, b = generate_points(trials);
distance = d(a, b);
expected_distance = mean(distance)

lmiq · November 8, 2020, 11:14am

Avoid using push to increment your data point if you know the final size of the arrays, use, better:

a_point = Vector{Vector{Float64}}(undef,trials)

and later use

a_point[i] = rand(2)

for each element.

Anyway, in your example you are just generating a lot of random points, you could use:

a_points = [ rand(2) for i in 1:trials ]

Same thing with dist, try to allocate it complete if possible, with dist = zeros(trials), or dist=Vector{Float64}(undef,trials), and then use
dist[i] = sqrt(....)
Even if you had less information a priori about the size of the vectors, it is important to provide which type of element they will contain. For example:

d = []

will perform much worse than

d = Float64[]

Finally, for that specific problem, you do not need to save the points at all (if you only want the final average distance), thus probably you do not need to generate the points and compute the distances in two separate functions and loops. Do all at once generating the points on the flight.

edit: probably this is a homework, please make that clear.

s0916314 · November 8, 2020, 11:25am

Thanks for the feedback. No, it’s not Homework I’m just trying to learn Julia for personal interest. Probably best not to assume without proper justification.

foobar_lv2 · November 8, 2020, 11:29am

julia> using Random, Statistics, StaticArrays, LinearAlgebra, BenchmarkTools
julia> generate_points(nTrials) = (rand(SVector{2, Float64}, nTrials), rand(SVector{2, Float64}, nTrials))
julia> d(a, b) = map((x,y)->norm(x-y), a, b)
julia> @btime begin
       a,b = generate_points(trials);
       distance = d(a,b);
       meandist = mean(distance);
       end;
  16.110 μs (5 allocations: 39.48 KiB)

Your original code contained a bunch of anti-patterns; primarily, you should almost never initialize an accumulator array with buf = [], because that is equivalent to buf = Vector{Any}(); second, you should use available high-quality libraries like StaticArrays; third, pushing into a Vector in a loop is bad style in cases where you know the desired size beforehand.

s0916314 · November 8, 2020, 11:35am

Thanks for the reply, It’s really helpful. Didn’t even think about StaticArrays . Just so I can learn, what do you mean by “anti-patterns”?

lmiq · November 8, 2020, 11:48am

Probably this is one “anti-pattern”. You are generating a tuple of random points. Tuples are immutable, so this can potentially be more performant than simply rand(2), which generates a vector, which is mutable. Yet, you could have used

random_point = (rand(),rand())

but that would be also somewhat unnatural (yet, for this specific purpose, just fine), if you really want to use static arrays, you would want to the StaticArrays package:

julia> using StaticArrays

julia> random_point = @SVector rand(2)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.9595201727211979
 0.06002658612324363

(but I think using static arrays is a future step here, after solving other more important issues)

mauro3 · November 8, 2020, 12:33pm

Somewhat related, for MC calculations have a look at
https://baggepinnen.github.io/MonteCarloMeasurements.jl/latest/
Awesome package which makes it easy to do MC. Your original example becomes:

julia> using MonteCarloMeasurements, Distributions

julia> n = 1000;

julia> p1 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))]
2-element Array{Particles{Float64,1000},1}:
 0.5 ± 0.29
 0.5 ± 0.29

julia> p2 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))];

julia> dist(p1, p2) = sqrt((p1[1]-p2[1])^2 + (p1[2]-p2[2])^2)
dist (generic function with 1 method)

julia> dist(p1,p2)
Particles{Float64,1000}
 0.523052 ± 0.25

I think it’s fast too, here the timing on a new laptop:

julia> @btime dist(p1,p2)
  9.568 μs (7 allocations: 47.64 KiB)
Particles{Float64,1000}
 0.523052 ± 0.25

mcabbott · November 8, 2020, 12:54pm

Instead of making these points individually, you may wish to keep them as columns of a matrix. In which case this is a possible one-line solution:

julia> using Distances

julia> @btime mean(pairwise(Euclidean(), rand(2, 10^3), dims=2))
  3.188 ms (4 allocations: 7.65 MiB)
0.5246766109743328

julia> @btime mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3)))
  6.929 μs (3 allocations: 39.44 KiB)
0.5127943200515588

Note that the first way is N points and N^2 distances, while the second (and the original post) are 2N points & N distances.

The N^2 way seems to give a smaller deviation, I’m surprised by the ± 0.25 estimate above:

julia> std([mean(pairwise(Euclidean(), rand(2, 10^3), dims=2)) for _ in 1:500])
0.005379150837174151

julia> std([mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3))) for _ in 1:500])
0.008066908056271286

lmiq · November 8, 2020, 1:12pm

Just for the sake of didactics:

The “natural” way to solve that problem is probably:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = rand(2)
           y = rand(2)
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end

julia> @btime meandist(1000)
  88.329 μs (2000 allocations: 187.50 KiB)
0.5351316591605562

This is quite slow and allocates a lot, because of the allocations of x and y random vectors at each iteration of the loop. These allocations can be solved by using static allocations, with tuples:

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = (rand(),rand())
           y = (rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.012 μs (0 allocations: 0 bytes)
0.5122565299210201

or with the StaticArrays package:

julia> using StaticArrays

julia> function meandist(n)
         d = 0.
         for i in 1:n
           x = SVector{2,Float64}(rand(),rand())
           y = SVector{2,Float64}(rand(),rand())
           d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  13.027 μs (0 allocations: 0 bytes)
0.5096925569126717

These options are not as fast as the one that uses the Distances package because there the computation of the distances is loop-vectorized. We can do that “by hand” (edit: without manually using @avx the performance is the same - I guess loop vectorization occurs anyway, automatically):

julia> using LoopVectorization

julia> function meandist(n)
         d = 0.
         x = rand(2,n)
         y = rand(2,n)
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> @btime meandist(1000)
  5.006 μs (2 allocations: 31.50 KiB)
0.5336431684976665

and, of course, lot of that time is just the generation of the two matrices:

julia> function meandist(n,x,y)
         d = 0.
         @avx for i in 1:n
           d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(2,1000); y = rand(2,1000);

julia> @btime meandist(1000,$x,$y)
  1.530 μs (0 allocations: 0 bytes)
0.5345441925830112

Edit: Without generating the list of random points inside the function, the option with static arrays is as fast as the loop-vectorized (perhaps it gets loop-vectorized by the compiler automatically):

julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i][1]-y[i][1])^2+(x[i][2]-y[i][2])^2)
         end
         return d/n
       end
meandist (generic function with 2 methods)

julia> x = rand(SVector{2,Float64},1000)

julia> y = rand(SVector{2,Float64},1000)

julia> @btime meandist(1000,$x,$y)
  1.508 μs (0 allocations: 0 bytes)
0.5239587680180073

The equivalent version without static arrays is slower, but not bad:

julia> x = [ rand(2) for i in 1:1000 ];

julia> y = [ rand(2) for i in 1:1000 ];

julia> @btime meandist(1000,$x,$y)
  3.292 μs (0 allocations: 0 bytes)
0.5311847109064113

Edit 2:

Given that allocating the vectors outside the function gives similar results for static arrays or matrices, the difference is in the time required to generate those. There are important differences there, indeed:

julia> @btime [ SVector{2,Float64}(rand(),rand()) for i in 1:1000 ];
  12.237 μs (1 allocation: 15.75 KiB)

julia> @btime rand(SVector{2,Float64},1000);
  3.195 μs (1 allocation: 15.75 KiB)

julia> @btime rand(2,1000);
  1.605 μs (1 allocation: 15.75 KiB)

mcabbott · November 8, 2020, 2:39pm

That’s surprising, and surely fixable.

I think you may not be getting much from @avx here, if I replace it with @inbounds that function is equally quick. But if you permute the matrices, so that it iterates along the dense dimension, then it’s quicker:

julia> @btime meandist(1000,$x,$y); # as before, but on my computer
  1.704 μs (0 allocations: 0 bytes)

julia> @btime meandist0(1000,$x,$y); # version without @avx
  1.674 μs (0 allocations: 0 bytes)

julia> xt = rand(1000, 2)'; yt = rand(1000, 2)';

julia> @btime meandist(1000,$xt,$yt); # iterating along stride=1 direction
  838.082 ns (0 allocations: 0 bytes)

julia> strides(x)
(1, 2)

julia> strides(xt)
(1000, 1)

lmiq · November 8, 2020, 2:56pm

Excellent, thanks. Yes, for the loop vectorization be efficient the larger dimension should be column-based.

rafael.guerra · November 8, 2020, 3:27pm

According to Cohen and Courgeau (2019), it seems that the theoretical standard deviation is ~0.248 (variance is ~0.0615)

mcabbott · November 8, 2020, 3:49pm

Oh right, of course there are two standard deviations, for one pair of points, or for all 1000. Adapting my line above, this is the one-pair case, which is clearly what MonteCarloMeasurements was tracking:

julia> std([only(colwise(Euclidean(), rand(2, 1), rand(2, 1))) for _ in 1:500]) 
0.24930807859433118

tbeason · November 8, 2020, 4:32pm

Here is an alternative which can be multithreaded by using ThreadsX.jl and OnlineStats.jl. Also useful for when N is quite large. Even doing 1 billion comparisons only took a handful of seconds for me.

julia> distfun(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
distfun (generic function with 1 method)

julia> m=ThreadsX.reduce(Moments(), distfun(rand(2),rand(2)) for _ in 1:10000000)
Moments: n=10000000 | value=[0.52133, 0.333226, 0.240583, 0.188733]

julia> mean(m)
0.5213296332974621

julia> std(m)
0.24787385140189952

Don’t forget to start Julia with multiple threads, eg. julia -t N

Skoffer · November 8, 2020, 4:38pm

It is also may be useful to use another RNGs

using BenchmarkTools
using RandomNumbers.Xorshifts
using StableRNGs

function meandist(n, rng = Random.GLOBAL_RNG)
    d = 0.

    x = rand(rng, 2, n)
    y = rand(rng, 2, n)
    for i in 1:n
        d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
    end
    return d/n
end

function meandist2(n, rng = Random.GLOBAL_RNG)
    d = 0.

    for i in 1:n
        x = (rand(rng),rand(rng))
        y = (rand(rng),rand(rng))
        d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return d/n
end

julia> @btime meandist(1000)
  5.346 μs (2 allocations: 31.50 KiB)

julia> rng = Xoroshiro128Plus()
julia> @btime meandist2(1000, $rng)
  5.154 μs (0 allocations: 0 bytes)

julia> rng = StableRNG(2020)
julia> @btime meandist2(1000, $rng)
  4.744 μs (0 allocations: 0 bytes)

So other RNGs can be faster and can use non allocating scheme.

lmiq · November 8, 2020, 5:53pm

Again, just for didactical purposes (my aim here is to show that those “one-liners” are not magic, they can be written in Julia, I think that they might be a show-stopper for a new user, because while using package functions is great and must be done, what differeciate Julia from scripted languages is that the performance of such functions can be achieved by the user and, thus, these functions can be customized in every detail depending on the user case):


julia> function meandist(n,x,y)
         d = 0.
         for i in 1:n
           d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return d/n
       end
meandist (generic function with 1 method)

julia> function meandist_parallel(n,x,y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 1 method)


julia> n = 100_000_000
100000000

julia> x = rand(n,2); y = rand(n,2);

julia> using BenchmarkTools

julia> @btime meandist($n,$x,$y)
  212.870 ms (0 allocations: 0 bytes)
0.5213806013781946

julia> @btime meandist_parallel($n,$x,$y)
  134.889 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> Threads.nthreads()
4

Scaling here is not great, but the idea is simple.

Skoffer · November 8, 2020, 6:21pm

Multi-threading version, which uses random number generation is more involved, it was discussed here

In a simplified version it may look like this

function meandist3(n, rng)
    d = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        k = Threads.threadid()
        x = (rand(rng[k]),rand(rng[k]))
        y = (rand(rng[k]),rand(rng[k]))
        d[k] += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
    end
    return sum(d)/n
end

And benchmarks

@btime meandist(100_000_000)
  # 1.690 s (4 allocations: 2.98 GiB)

rngs = Tuple(StableRNG.(2020 .+ collect(1:Threads.nthreads())))
@btime meandist3(100_000_000, $rngs)
  # 1.060 s (43 allocations: 6.13 KiB)

It is important, because if you have less then 30Gb it is the only way to run

julia> @time meandist3(1_000_000_000, rngs)
 13.683740 seconds (44 allocations: 6.141 KiB)
0.5214066471370079

EDIT: ha, interestingly enough, MersenneTwister suddenly is the best here

julia> rngs = Tuple(Random.MersenneTwister.(2020 .+ collect(1:Threads.nthreads())))
julia> @time meandist3(1_000_000_000, rngs)
  8.425097 seconds (48 allocations: 6.266 KiB)
0.521398839144987

rafael.guerra · November 8, 2020, 7:04pm

I have tried to follow the great video from Prof. Edelman here, to sketch a solution using FLoops.jl. This seems to work relatively fast on Win10 laptop with 10 processors, but I do not understand the difference with the Threads approach above:

using FLoops, BenchmarkTools
function meandist_parallel(n,x,y,ncores)
    d = 0
    @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
       @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
    end
    return d/n
end

n = 100_000_000
x = rand(n,2); y = rand(n,2);

julia> @btime meandist_parallel($n,$x,$y,10)   # ncores=10
  106.135 ms (644 allocations: 36.56 KiB)
0.5214049793107156

Thank you.

lmiq · November 9, 2020, 12:18am

My short answer: I don’t know.

Since nobody really inside this answered, my long answer:

Here the option with Threads performs as well as that FLoops code you sent just by adding a @inbounds:

julia> function meandist_parallel($n,$x,$y)
         d = zeros(Threads.nthreads())
         Threads.@threads for i in 1:n
           @inbounds d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
         end
         return sum(d)/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y)
  112.104 ms (22 allocations: 3.11 KiB)
0.5213806013784907

julia> function meandist_parallel(n,x,y,ncores)
           d = 0
           @floop ThreadedEx(basesize=n÷ncores) for i in 1:n
              @reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
           end
           return d/n
       end
meandist_parallel (generic function with 2 methods)

julia> @btime meandist_parallel($n,$x,$y,Threads.nthreads())
  112.343 ms (136 allocations: 7.27 KiB)
0.5213806013784907

(and allocates less, thus in this case the first option is in general better).

My two-cents is that there are many packages which have been part of the development of the multi-threading and parallelization development of Julia and now are becoming part of Base or getting deprecated, and that is one of those. But I might be very wrong.

Skoffer · November 9, 2020, 8:12am

No, this is not exactly true, FLoops.jl is definitely not a deprecated package. Maybe the author of the package would come in and explain it better than I, but originally FLoops.jl and it’s brother ThreadsX.jl (you can read corresponding announcements here and here ) are built on the basis of Transducers.jl and they are very flexible and powerful tools for writing high-level multithreading and parallel processing code.

In this particular problem, they do not shine, because this is a typical embarrassingly parallel problem and indeed @Threads.threads macro is enough to solve it. But they are useful in other scenarios, for example, parallel sorting, where simple @threads macro can’t be applied.

Topic		Replies	Views
Effective simulation of putting n-1 balls in n boxes uniformly at random Performance	12	1155	May 4, 2019
Monte Carlo simulation function performance New to Julia question	4	1024	March 3, 2018
Improving performance of mapping algorithm Performance question	17	893	April 28, 2020
A simple Monte Carlo in Julia and C++ New to Julia question	15	6825	March 15, 2018
Monte Carlo particle simulations are very slow Performance	9	969	May 11, 2020

Advice for improving Monte-Carlo code

Related topics