Thanks for the reply, It’s really helpful. Didn’t even think about StaticArrays
. Just so I can learn, what do you mean by “anti-patterns”?
Probably this is one “anti-pattern”. You are generating a tuple of random points. Tuples are immutable, so this can potentially be more performant than simply rand(2)
, which generates a vector, which is mutable. Yet, you could have used
random_point = (rand(),rand())
but that would be also somewhat unnatural (yet, for this specific purpose, just fine), if you really want to use static arrays, you would want to the StaticArrays package:
julia> using StaticArrays
julia> random_point = @SVector rand(2)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
0.9595201727211979
0.06002658612324363
(but I think using static arrays is a future step here, after solving other more important issues)
Somewhat related, for MC calculations have a look at
https://baggepinnen.github.io/MonteCarloMeasurements.jl/latest/
Awesome package which makes it easy to do MC. Your original example becomes:
julia> using MonteCarloMeasurements, Distributions
julia> n = 1000;
julia> p1 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))]
2-element Array{Particles{Float64,1000},1}:
0.5 ± 0.29
0.5 ± 0.29
julia> p2 = [Particles(n, Uniform(0,1)), Particles(n, Uniform(0,1))];
julia> dist(p1, p2) = sqrt((p1[1]-p2[1])^2 + (p1[2]-p2[2])^2)
dist (generic function with 1 method)
julia> dist(p1,p2)
Particles{Float64,1000}
0.523052 ± 0.25
I think it’s fast too, here the timing on a new laptop:
julia> @btime dist(p1,p2)
9.568 μs (7 allocations: 47.64 KiB)
Particles{Float64,1000}
0.523052 ± 0.25
Instead of making these points individually, you may wish to keep them as columns of a matrix. In which case this is a possible one-line solution:
julia> using Distances
julia> @btime mean(pairwise(Euclidean(), rand(2, 10^3), dims=2))
3.188 ms (4 allocations: 7.65 MiB)
0.5246766109743328
julia> @btime mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3)))
6.929 μs (3 allocations: 39.44 KiB)
0.5127943200515588
Note that the first way is N points and N^2 distances, while the second (and the original post) are 2N points & N distances.
The N^2 way seems to give a smaller deviation, I’m surprised by the ± 0.25
estimate above:
julia> std([mean(pairwise(Euclidean(), rand(2, 10^3), dims=2)) for _ in 1:500])
0.005379150837174151
julia> std([mean(colwise(Euclidean(), rand(2, 10^3), rand(2, 10^3))) for _ in 1:500])
0.008066908056271286
Just for the sake of didactics:
The “natural” way to solve that problem is probably:
julia> function meandist(n)
d = 0.
for i in 1:n
x = rand(2)
y = rand(2)
d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
end
return d/n
end
julia> @btime meandist(1000)
88.329 μs (2000 allocations: 187.50 KiB)
0.5351316591605562
This is quite slow and allocates a lot, because of the allocations of x
and y
random vectors at each iteration of the loop. These allocations can be solved by using static allocations, with tuples:
julia> function meandist(n)
d = 0.
for i in 1:n
x = (rand(),rand())
y = (rand(),rand())
d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
end
return d/n
end
meandist (generic function with 1 method)
julia> @btime meandist(1000)
13.012 μs (0 allocations: 0 bytes)
0.5122565299210201
or with the StaticArrays package:
julia> using StaticArrays
julia> function meandist(n)
d = 0.
for i in 1:n
x = SVector{2,Float64}(rand(),rand())
y = SVector{2,Float64}(rand(),rand())
d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
end
return d/n
end
meandist (generic function with 1 method)
julia> @btime meandist(1000)
13.027 μs (0 allocations: 0 bytes)
0.5096925569126717
These options are not as fast as the one that uses the Distances
package because there the computation of the distances is loop-vectorized. We can do that “by hand” (edit: without manually using @avx
the performance is the same - I guess loop vectorization occurs anyway, automatically):
julia> using LoopVectorization
julia> function meandist(n)
d = 0.
x = rand(2,n)
y = rand(2,n)
@avx for i in 1:n
d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
end
return d/n
end
meandist (generic function with 1 method)
julia> @btime meandist(1000)
5.006 μs (2 allocations: 31.50 KiB)
0.5336431684976665
and, of course, lot of that time is just the generation of the two matrices:
julia> function meandist(n,x,y)
d = 0.
@avx for i in 1:n
d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
end
return d/n
end
meandist (generic function with 2 methods)
julia> x = rand(2,1000); y = rand(2,1000);
julia> @btime meandist(1000,$x,$y)
1.530 μs (0 allocations: 0 bytes)
0.5345441925830112
Edit: Without generating the list of random points inside the function, the option with static arrays is as fast as the loop-vectorized (perhaps it gets loop-vectorized by the compiler automatically):
julia> function meandist(n,x,y)
d = 0.
for i in 1:n
d += sqrt((x[i][1]-y[i][1])^2+(x[i][2]-y[i][2])^2)
end
return d/n
end
meandist (generic function with 2 methods)
julia> x = rand(SVector{2,Float64},1000)
julia> y = rand(SVector{2,Float64},1000)
julia> @btime meandist(1000,$x,$y)
1.508 μs (0 allocations: 0 bytes)
0.5239587680180073
The equivalent version without static arrays is slower, but not bad:
julia> x = [ rand(2) for i in 1:1000 ];
julia> y = [ rand(2) for i in 1:1000 ];
julia> @btime meandist(1000,$x,$y)
3.292 μs (0 allocations: 0 bytes)
0.5311847109064113
Edit 2:
Given that allocating the vectors outside the function gives similar results for static arrays or matrices, the difference is in the time required to generate those. There are important differences there, indeed:
julia> @btime [ SVector{2,Float64}(rand(),rand()) for i in 1:1000 ];
12.237 μs (1 allocation: 15.75 KiB)
julia> @btime rand(SVector{2,Float64},1000);
3.195 μs (1 allocation: 15.75 KiB)
julia> @btime rand(2,1000);
1.605 μs (1 allocation: 15.75 KiB)
That’s surprising, and surely fixable.
I think you may not be getting much from @avx
here, if I replace it with @inbounds
that function is equally quick. But if you permute the matrices, so that it iterates along the dense dimension, then it’s quicker:
julia> @btime meandist(1000,$x,$y); # as before, but on my computer
1.704 μs (0 allocations: 0 bytes)
julia> @btime meandist0(1000,$x,$y); # version without @avx
1.674 μs (0 allocations: 0 bytes)
julia> xt = rand(1000, 2)'; yt = rand(1000, 2)';
julia> @btime meandist(1000,$xt,$yt); # iterating along stride=1 direction
838.082 ns (0 allocations: 0 bytes)
julia> strides(x)
(1, 2)
julia> strides(xt)
(1000, 1)
Excellent, thanks. Yes, for the loop vectorization be efficient the larger dimension should be column-based.
According to Cohen and Courgeau (2019), it seems that the theoretical standard deviation is ~0.248 (variance is ~0.0615)
Oh right, of course there are two standard deviations, for one pair of points, or for all 1000. Adapting my line above, this is the one-pair case, which is clearly what MonteCarloMeasurements was tracking:
julia> std([only(colwise(Euclidean(), rand(2, 1), rand(2, 1))) for _ in 1:500])
0.24930807859433118
Here is an alternative which can be multithreaded by using ThreadsX.jl and OnlineStats.jl. Also useful for when N
is quite large. Even doing 1 billion comparisons only took a handful of seconds for me.
julia> distfun(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
distfun (generic function with 1 method)
julia> m=ThreadsX.reduce(Moments(), distfun(rand(2),rand(2)) for _ in 1:10000000)
Moments: n=10000000 | value=[0.52133, 0.333226, 0.240583, 0.188733]
julia> mean(m)
0.5213296332974621
julia> std(m)
0.24787385140189952
Don’t forget to start Julia with multiple threads, eg. julia -t N
It is also may be useful to use another RNGs
using BenchmarkTools
using RandomNumbers.Xorshifts
using StableRNGs
function meandist(n, rng = Random.GLOBAL_RNG)
d = 0.
x = rand(rng, 2, n)
y = rand(rng, 2, n)
for i in 1:n
d += sqrt((x[1,i]-y[1,i])^2+(x[2,i]-y[2,i])^2)
end
return d/n
end
function meandist2(n, rng = Random.GLOBAL_RNG)
d = 0.
for i in 1:n
x = (rand(rng),rand(rng))
y = (rand(rng),rand(rng))
d += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
end
return d/n
end
julia> @btime meandist(1000)
5.346 μs (2 allocations: 31.50 KiB)
julia> rng = Xoroshiro128Plus()
julia> @btime meandist2(1000, $rng)
5.154 μs (0 allocations: 0 bytes)
julia> rng = StableRNG(2020)
julia> @btime meandist2(1000, $rng)
4.744 μs (0 allocations: 0 bytes)
So other RNGs can be faster and can use non allocating scheme.
Again, just for didactical purposes (my aim here is to show that those “one-liners” are not magic, they can be written in Julia, I think that they might be a show-stopper for a new user, because while using package functions is great and must be done, what differeciate Julia from scripted languages is that the performance of such functions can be achieved by the user and, thus, these functions can be customized in every detail depending on the user case):
julia> function meandist(n,x,y)
d = 0.
for i in 1:n
d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
end
return d/n
end
meandist (generic function with 1 method)
julia> function meandist_parallel(n,x,y)
d = zeros(Threads.nthreads())
Threads.@threads for i in 1:n
d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
end
return sum(d)/n
end
meandist_parallel (generic function with 1 method)
julia> n = 100_000_000
100000000
julia> x = rand(n,2); y = rand(n,2);
julia> using BenchmarkTools
julia> @btime meandist($n,$x,$y)
212.870 ms (0 allocations: 0 bytes)
0.5213806013781946
julia> @btime meandist_parallel($n,$x,$y)
134.889 ms (22 allocations: 3.11 KiB)
0.5213806013784907
julia> Threads.nthreads()
4
Scaling here is not great, but the idea is simple.
Multi-threading version, which uses random number generation is more involved, it was discussed here
In a simplified version it may look like this
function meandist3(n, rng)
d = zeros(Threads.nthreads())
Threads.@threads for i in 1:n
k = Threads.threadid()
x = (rand(rng[k]),rand(rng[k]))
y = (rand(rng[k]),rand(rng[k]))
d[k] += sqrt((x[1]-y[1])^2+(x[2]-y[2])^2)
end
return sum(d)/n
end
And benchmarks
@btime meandist(100_000_000)
# 1.690 s (4 allocations: 2.98 GiB)
rngs = Tuple(StableRNG.(2020 .+ collect(1:Threads.nthreads())))
@btime meandist3(100_000_000, $rngs)
# 1.060 s (43 allocations: 6.13 KiB)
It is important, because if you have less then 30Gb it is the only way to run
julia> @time meandist3(1_000_000_000, rngs)
13.683740 seconds (44 allocations: 6.141 KiB)
0.5214066471370079
EDIT: ha, interestingly enough, MersenneTwister
suddenly is the best here
julia> rngs = Tuple(Random.MersenneTwister.(2020 .+ collect(1:Threads.nthreads())))
julia> @time meandist3(1_000_000_000, rngs)
8.425097 seconds (48 allocations: 6.266 KiB)
0.521398839144987
I have tried to follow the great video from Prof. Edelman here, to sketch a solution using FLoops.jl. This seems to work relatively fast on Win10 laptop with 10 processors, but I do not understand the difference with the Threads approach above:
using FLoops, BenchmarkTools
function meandist_parallel(n,x,y,ncores)
d = 0
@floop ThreadedEx(basesize=n÷ncores) for i in 1:n
@reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
end
return d/n
end
n = 100_000_000
x = rand(n,2); y = rand(n,2);
julia> @btime meandist_parallel($n,$x,$y,10) # ncores=10
106.135 ms (644 allocations: 36.56 KiB)
0.5214049793107156
Thank you.
My short answer: I don’t know.
Since nobody really inside this answered, my long answer:
Here the option with Threads
performs as well as that FLoops
code you sent just by adding a @inbounds
:
julia> function meandist_parallel($n,$x,$y)
d = zeros(Threads.nthreads())
Threads.@threads for i in 1:n
@inbounds d[Threads.threadid()] += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2)
end
return sum(d)/n
end
meandist_parallel (generic function with 2 methods)
julia> @btime meandist_parallel($n,$x,$y)
112.104 ms (22 allocations: 3.11 KiB)
0.5213806013784907
julia> function meandist_parallel(n,x,y,ncores)
d = 0
@floop ThreadedEx(basesize=n÷ncores) for i in 1:n
@reduce ( d += sqrt((x[i,1]-y[i,1])^2+(x[i,2]-y[i,2])^2) )
end
return d/n
end
meandist_parallel (generic function with 2 methods)
julia> @btime meandist_parallel($n,$x,$y,Threads.nthreads())
112.343 ms (136 allocations: 7.27 KiB)
0.5213806013784907
(and allocates less, thus in this case the first option is in general better).
My two-cents is that there are many packages which have been part of the development of the multi-threading and parallelization development of Julia and now are becoming part of Base or getting deprecated, and that is one of those. But I might be very wrong.
No, this is not exactly true, FLoops.jl
is definitely not a deprecated package. Maybe the author of the package would come in and explain it better than I, but originally FLoops.jl
and it’s brother ThreadsX.jl (you can read corresponding announcements here and here ) are built on the basis of Transducers.jl and they are very flexible and powerful tools for writing high-level multithreading and parallel processing code.
In this particular problem, they do not shine, because this is a typical embarrassingly parallel problem and indeed @Threads.threads
macro is enough to solve it. But they are useful in other scenarios, for example, parallel sorting, where simple @threads
macro can’t be applied.
@lmiq thank you.
For the record, for n = 1_500_000_000
, on a 64GB RAM Win10 laptop with 11 processors used, the results for the two parallel solutions above stay close in terms of speed using @btime
:
@floop: 1.568 s (645 allocations: 36.59 KiB)
Threads.@threads: 1.576 s (58 allocations: 9.50 KiB)
When n equals 2 billion, the Julia code still runs on my laptop but x,y
fill the RAM up and the @btime results seem to be non-repeatable.
FYI, with FLoops.jl, FoldsCUDA.jl, and Random123.jl, you can easily move this to GPU with a small modification:
using CUDA
using FLoops
using FoldsCUDA
using Random123
function counters(n)
stride = typemax(UInt64) ÷ n
return UInt64(0):stride:typemax(UInt64) - stride
end
function meandist_floop(n; m=10_000, ex=has_cuda_gpu() ? CUDAEx() : ThreadedEx())
@floop ex for ctr in counters(n)
rng = set_counter!(Philox2x(0), ctr)
s = 0.0
for _ in 1:m
x1 = rand(rng)
y1 = rand(rng)
x2 = rand(rng)
y2 = rand(rng)
s += sqrt((x1 - y1)^2 + (x2 - y2)^2)
end
@reduce(d = 0.0 + s)
end
return d / (n * m)
end
meandist_floop(2^10) # run on GPU if you have one
meandist_floop(2^10; ex = ThreadedEx()) # forcing to run on CPU threads
meandist_floop(2^10; ex = DistributedEx()) # run on multiple processes/nodes
For an explanation on how to use counter-based RNG on GPU, see: Monte-Carlo π · FoldsCUDA
@tkf: would it be possible to shed some more light on your top-class post:
-
On Windows 10 laptop the GPU option is not available (NVIDIA Quadro P3200), while the
ThreadedEx
andDistributedEx
options do run similarly fast (12 Intel Xeon E-2186 CPUs). What is the difference between these two options? -
Could you provide a basic explanation on the counter-based RNG logic? (i.e., why the problem is decoupled in n x m loops). The link that you have provided is rather hard to grasp.
Thank you in advance.
ThreadedEx
uses threads (Threads.@spawn
) and DistributedEx
uses multiple processes (Distributed.jl). Both let us use multiple CPUs but the latter let us use multiple machines. But if you are using only one machine, I’d try threading first. It’s much easier to share things and has less overhead.
But, unfortunately, choosing threading vs multi-processing is a bit complicated in Julia and case-specific. I’m trying to explain a useful guideline here: Frequently asked questions
I’m not a PRNG specialist but here’s my understanding . Recall that all PRNGs are “just” clever state machine with a large but finite period (the minimum number of iterations that takes for the PRNG to come back to the initial state). What is specific about counter-based PRNG is that you can set the state of the PRNG to the arbitrary point (“phase”) of the period with a very small computation. That’s the set_counter!
function I’m calling. So, since we know the period of the PRNG and there is a way to choose where to start, it lets us evenly split the entire period into chunks and use each chunk locally within a thread.
I’m introducing the inner loop because set_counter!
has an overhead comparable to the actual random number generation. So, it makes sense to use the RNG for a while after you set the counter.