I wanted to port a parallel in-place shuffle algorithm, mergeshuffle: paper, repo
it basically works by splitting the array into k subarrays, shuffling them in parallel. Followed by merging neighboring subarrays (in a parallel tree like fashion) in such a way that creates a global random permutation.
The C code is very straightforward, consisting of two routines: shuffle.c, merge.c (optionally an optimized assembly version merge.s)
I’ve created a Julia port of these functions replacing the openmp pragmas (#pragma omp parallel for) with Julia’s Threads.@threads (hoping that Julia’s macro will be just as simple and effective as openmp’s).
Following the docs I created an array of random number generators using a different one in each thread.
My code: mergeshuffle.jl
However, this version was slower than the sequential shuffle! (and even from a naive sequential implementation)
The results of shuffling a 10,000,000 array using 16 threads on a powerful server:
nthreads = 16
0.453090 seconds (135.32 k allocations: 6.736 MiB)
1.117024 seconds (936.77 k allocations: 45.491 MiB, 0.95% gc time)
0.254580 seconds (61.56 k allocations: 3.357 MiB)
where the first line is a naive Fisher-Yates shuffle, the second line corresponds to the mergeshuffle (the C implementation is much faster than the serial Fisher-Yates), and the 3rd line is Julia’s sequential but optimized shuffle!
It should be noted that varying the number of threads did have a (relative) impact:
nthreads
time
1
2.574
2
1.652
4
1.242
8
1.195
16
1.117
32
1.121
64
1.045
What can I do to improve the absolute performance of the parallel random permutation algorithm?
Additionally, is the RNG threading hack still necessary in version 1.5?
Which will run the code several times. It might be that you’re mostly measuring the first-call compile time, and the benchmark tools does a good job of getting more reliable stats.
Thanks for your comment
Unfortunately, the problem in this experiment is not due to the “measurement device”.
You can see that my file has a using BenchmarkTools at the top, and initially I used @btime.
I switched to using @time because it runs much faster and the fluctuations are not significant.
I’d appreciate any ideas of how to get the code to run faster.
I profiled the routine, and it seems that most of the time is dedicated to indexing which is reasonable.
My guess is that this is memory bandwidth limited and so a sequential algorithm can be as fast as anything. the fact that it spends most of its time indexing suggests that’s true. parallel may be pointless or even cache harmful here.
thanks!
I had a typo when converting from C, the corrected line reads: @views shuffle!(r, v[i:n]) (I’ve updated the gist)
The run times have improved…
but it is almost as fast as the sequential Fisher Yates version when using 16 threads, and much slower than the builtin sequential shuffle!
looking at the profiler output it seems most of the time is spent on rand(::MersenneTwister, ::Rando... calls
Any suggestions how to speed up the random bit generation?