Multithreading shuffle

hros · July 8, 2020, 12:29am

I wanted to port a parallel in-place shuffle algorithm, mergeshuffle:
paper, repo
it basically works by splitting the array into k subarrays, shuffling them in parallel. Followed by merging neighboring subarrays (in a parallel tree like fashion) in such a way that creates a global random permutation.
The C code is very straightforward, consisting of two routines:
shuffle.c, merge.c (optionally an optimized assembly version merge.s)

I’ve created a Julia port of these functions replacing the openmp pragmas (#pragma omp parallel for) with Julia’s Threads.@threads (hoping that Julia’s macro will be just as simple and effective as openmp’s).
Following the docs I created an array of random number generators using a different one in each thread.
My code: mergeshuffle.jl

However, this version was slower than the sequential shuffle! (and even from a naive sequential implementation)
The results of shuffling a 10,000,000 array using 16 threads on a powerful server:

nthreads = 16
  0.453090 seconds (135.32 k allocations: 6.736 MiB)
  1.117024 seconds (936.77 k allocations: 45.491 MiB, 0.95% gc time)
  0.254580 seconds (61.56 k allocations: 3.357 MiB)

where the first line is a naive Fisher-Yates shuffle, the second line corresponds to the mergeshuffle (the C implementation is much faster than the serial Fisher-Yates), and the 3rd line is Julia’s sequential but optimized shuffle!

It should be noted that varying the number of threads did have a (relative) impact:

nthreads	time
1	2.574
2	1.652
4	1.242
8	1.195
16	1.117
32	1.121
64	1.045

What can I do to improve the absolute performance of the parallel random permutation algorithm?
Additionally, is the RNG threading hack still necessary in version 1.5?

dlakelan · July 8, 2020, 1:49am

instead of @time, use BenchmarkTools

@benchmark

Which will run the code several times. It might be that you’re mostly measuring the first-call compile time, and the benchmark tools does a good job of getting more reliable stats.

hros · July 8, 2020, 1:48pm

Thanks for your comment
Unfortunately, the problem in this experiment is not due to the “measurement device”.
You can see that my file has a using BenchmarkTools at the top, and initially I used @btime.
I switched to using @time because it runs much faster and the fluctuations are not significant.
I’d appreciate any ideas of how to get the code to run faster.
I profiled the routine, and it seems that most of the time is dedicated to indexing which is reasonable.

dlakelan · July 8, 2020, 2:16pm

My guess is that this is memory bandwidth limited and so a sequential algorithm can be as fast as anything. the fact that it spends most of its time indexing suggests that’s true. parallel may be pointless or even cache harmful here.

mikkoku · July 8, 2020, 5:41pm

I tried your code and glanced at the paper.
Looking more closely at the profiler output tells me that ~70% of time is spent on line 40

@views shuffle!(r, v[1:i])

which according to the paper should be negligible.

stillyslalom · July 8, 2020, 6:28pm

Have you tried this on v1.5? This may have been fixed by https://github.com/JuliaLang/julia/pull/34126

**edit: for v1.4 vs. v1.5, I see the following results on my machine (with four threads):

julia> @time mergeshuffle!(PAR_RNG, v)
  1.274356 seconds (1.27 k allocations: 75.406 KiB) # v1.4

julia> @time mergeshuffle!(PAR_RNG, v)
  1.419680 seconds (485 allocations: 40.422 KiB) # v1.5

So the number of allocations is indeed reduced, but performance actually decreases in v1.5.

hros · July 9, 2020, 10:06pm

thanks!
I had a typo when converting from C, the corrected line reads:
@views shuffle!(r, v[i:n]) (I’ve updated the gist)
The run times have improved…
but it is almost as fast as the sequential Fisher Yates version when using 16 threads, and much slower than the builtin sequential shuffle!

looking at the profiler output it seems most of the time is spent on rand(::MersenneTwister, ::Rando... calls

Any suggestions how to speed up the random bit generation?

mikkoku · July 10, 2020, 6:31am

I think rand(Bool) would be a bit faster than rand(0:1), but it still uses atleast 32 random bits.

bitrand

sounds promising.

Topic		Replies	Views
ANN: TupleShuffling: shuffling of tuples and generating random permutation tuples Package Announcements package , announcement , tuple , random , recursion	3	412	October 28, 2023
Poor performance on cluster multithreading Performance performance , parallel , multithreading , cluster	40	4018	July 11, 2018
Sort on tuple is > 4x as fast as sortrows on the equivalent array General Usage performance , sort	2	658	September 13, 2017
Random number and parallel execution New to Julia	24	2178	March 15, 2021
[ANN] Shuffle.jl: requesting feedback Package Announcements	10	987	October 14, 2020

Multithreading shuffle

Related topics