[ANN] ParallelUtilities: fast and easy parallel mapreduce for HPC clusters

Evaluating a parallel mapreduce on an HPC cluster can be as simple as changing mapreduce to pmapreduce. This is a package that I had developed for my own research, but this might be useful to the community so I’m announcing this. This is primarily aimed at mapreduce operations where the map is embarrassingly parallel.

Link to the package: GitHub - jishnub/ParallelUtilities.jl: Fast and easy parallel mapreduce on HPC clusters

This package uses a binary tree to reduce values evaluated on each node of a cluster. There are two levels of trees – one on each node and one across nodes. This way communication across nodes is minimized. This seems to have performance benefits:

Comparing with Julia’s distributed for loop (on 2 nodes with 28 cores each):

julia> @everywhere f(x) = ones(10_000, 1_000);

julia> A = @time @distributed (+) for i=1:nworkers()
 22.637764 seconds (3.35 M allocations: 8.383 GiB, 16.50% gc time, 0.09% compilation time)

julia> B = @time pmapreduce(f, +, 1:nworkers());
  2.170926 seconds (20.47 k allocations: 77.117 MiB)

Comparing with other packages that perform a parallel mapreduce (on 28 processors on 1 node of a cluster):

julia> @everywhere f(x) = ones(10_000, 10_000);

julia> A = @time ParallelUtilities.pmapreduce(f, +, 1:nworkers());
 10.105696 seconds (14.03 k allocations: 763.511 MiB)

julia> B = @time ParallelMapReduce.pmapreduce(f, +, 1:nworkers(), algorithm = :reduction_local);
 30.955381 seconds (231.93 k allocations: 41.200 GiB, 7.63% gc time, 0.23% compilation time)

julia> C = @time Transducers.foldxd(+, 1:nworkers() |> Transducers.Map(f));
 30.154166 seconds (655.40 k allocations: 41.015 GiB, 8.65% gc time, 1.03% compilation time)

julia> A == B == C

This package is not at the same performance level as MPI though, which is persumably a lot more optimized. However the advantage is that this is purely Julia, so any serial mapreduce operation may be easily shifted to parallel without restructuring the code.

This works in several modes: firstly in an entirely distributed multiprocessing mode where each process does its own thing followed by an eventual reduction, secondly using SharedArrays, where workers on each node populate a SharedArray followed by an eventual reduction, and thirdly coupled with multithreading, where workers on each node use threads to obtain a result which is subsequently reduced across nodes using multiprocessing. There are examples of usage in the documentation, which for some reason I’m not being obtain the latest version of, but hopefully that’ll be sorted soon.