Multithreading shuffle

I wanted to port a parallel in-place shuffle algorithm, mergeshuffle:
paper, repo
it basically works by splitting the array into k subarrays, shuffling them in parallel. Followed by merging neighboring subarrays (in a parallel tree like fashion) in such a way that creates a global random permutation.
The C code is very straightforward, consisting of two routines:
shuffle.c, merge.c (optionally an optimized assembly version merge.s)

I’ve created a Julia port of these functions replacing the openmp pragmas (#pragma omp parallel for) with Julia’s Threads.@threads (hoping that Julia’s macro will be just as simple and effective as openmp’s).
Following the docs I created an array of random number generators using a different one in each thread.
My code: mergeshuffle.jl

However, this version was slower than the sequential shuffle! (and even from a naive sequential implementation)
The results of shuffling a 10,000,000 array using 16 threads on a powerful server:

nthreads = 16
  0.453090 seconds (135.32 k allocations: 6.736 MiB)
  1.117024 seconds (936.77 k allocations: 45.491 MiB, 0.95% gc time)
  0.254580 seconds (61.56 k allocations: 3.357 MiB)

where the first line is a naive Fisher-Yates shuffle, the second line corresponds to the mergeshuffle (the C implementation is much faster than the serial Fisher-Yates), and the 3rd line is Julia’s sequential but optimized shuffle!

It should be noted that varying the number of threads did have a (relative) impact:

nthreads time
1 2.574
2 1.652
4 1.242
8 1.195
16 1.117
32 1.121
64 1.045

What can I do to improve the absolute performance of the parallel random permutation algorithm?
Additionally, is the RNG threading hack still necessary in version 1.5?

instead of @time, use BenchmarkTools


Which will run the code several times. It might be that you’re mostly measuring the first-call compile time, and the benchmark tools does a good job of getting more reliable stats.

1 Like

Thanks for your comment
Unfortunately, the problem in this experiment is not due to the “measurement device”.
You can see that my file has a using BenchmarkTools at the top, and initially I used @btime.
I switched to using @time because it runs much faster and the fluctuations are not significant.
I’d appreciate any ideas of how to get the code to run faster.
I profiled the routine, and it seems that most of the time is dedicated to indexing which is reasonable.

My guess is that this is memory bandwidth limited and so a sequential algorithm can be as fast as anything. the fact that it spends most of its time indexing suggests that’s true. parallel may be pointless or even cache harmful here.

1 Like

I tried your code and glanced at the paper.
Looking more closely at the profiler output tells me that ~70% of time is spent on line 40

@views shuffle!(r, v[1:i])

which according to the paper should be negligible.


Have you tried this on v1.5? This may have been fixed by

**edit: for v1.4 vs. v1.5, I see the following results on my machine (with four threads):

julia> @time mergeshuffle!(PAR_RNG, v)
  1.274356 seconds (1.27 k allocations: 75.406 KiB) # v1.4
julia> @time mergeshuffle!(PAR_RNG, v)
  1.419680 seconds (485 allocations: 40.422 KiB) # v1.5

So the number of allocations is indeed reduced, but performance actually decreases in v1.5.

1 Like

I had a typo when converting from C, the corrected line reads:
@views shuffle!(r, v[i:n]) (I’ve updated the gist)
The run times have improved…
but it is almost as fast as the sequential Fisher Yates version when using 16 threads, and much slower than the builtin sequential shuffle!

looking at the profiler output it seems most of the time is spent on rand(::MersenneTwister, ::Rando... calls

Any suggestions how to speed up the random bit generation?

I think rand(Bool) would be a bit faster than rand(0:1), but it still uses atleast 32 random bits.


sounds promising.

1 Like