I wanted to port a parallel in-place shuffle algorithm, mergeshuffle:
it basically works by splitting the array into
k subarrays, shuffling them in parallel. Followed by merging neighboring subarrays (in a parallel tree like fashion) in such a way that creates a global random permutation.
The C code is very straightforward, consisting of two routines:
shuffle.c, merge.c (optionally an optimized assembly version merge.s)
I’ve created a Julia port of these functions replacing the openmp pragmas (
#pragma omp parallel for) with Julia’s
Threads.@threads (hoping that Julia’s macro will be just as simple and effective as openmp’s).
Following the docs I created an array of random number generators using a different one in each thread.
My code: mergeshuffle.jl
However, this version was slower than the sequential
shuffle! (and even from a naive sequential implementation)
The results of shuffling a 10,000,000 array using 16 threads on a powerful server:
nthreads = 16 0.453090 seconds (135.32 k allocations: 6.736 MiB) 1.117024 seconds (936.77 k allocations: 45.491 MiB, 0.95% gc time) 0.254580 seconds (61.56 k allocations: 3.357 MiB)
where the first line is a naive Fisher-Yates shuffle, the second line corresponds to the mergeshuffle (the C implementation is much faster than the serial Fisher-Yates), and the 3rd line is Julia’s sequential but optimized shuffle!
It should be noted that varying the number of threads did have a (relative) impact:
What can I do to improve the absolute performance of the parallel random permutation algorithm?
Additionally, is the RNG threading hack still necessary in version 1.5?