Evaluating a parallel
mapreduce on an HPC cluster can be as simple as changing
pmapreduce. This is a package that I had developed for my own research, but this might be useful to the community so I’m announcing this. This is primarily aimed at mapreduce operations where the map is embarrassingly parallel.
This package uses a binary tree to reduce values evaluated on each node of a cluster. There are two levels of trees – one on each node and one across nodes. This way communication across nodes is minimized. This seems to have performance benefits:
Comparing with Julia’s distributed for loop (on 2 nodes with 28 cores each):
julia> @everywhere f(x) = ones(10_000, 1_000); julia> A = @time @distributed (+) for i=1:nworkers() f(i) end; 22.637764 seconds (3.35 M allocations: 8.383 GiB, 16.50% gc time, 0.09% compilation time) julia> B = @time pmapreduce(f, +, 1:nworkers()); 2.170926 seconds (20.47 k allocations: 77.117 MiB)
Comparing with other packages that perform a parallel
mapreduce (on 28 processors on 1 node of a cluster):
julia> @everywhere f(x) = ones(10_000, 10_000); julia> A = @time ParallelUtilities.pmapreduce(f, +, 1:nworkers()); 10.105696 seconds (14.03 k allocations: 763.511 MiB) julia> B = @time ParallelMapReduce.pmapreduce(f, +, 1:nworkers(), algorithm = :reduction_local); 30.955381 seconds (231.93 k allocations: 41.200 GiB, 7.63% gc time, 0.23% compilation time) julia> C = @time Transducers.foldxd(+, 1:nworkers() |> Transducers.Map(f)); 30.154166 seconds (655.40 k allocations: 41.015 GiB, 8.65% gc time, 1.03% compilation time) julia> A == B == C true
This package is not at the same performance level as MPI though, which is persumably a lot more optimized. However the advantage is that this is purely Julia, so any serial
mapreduce operation may be easily shifted to parallel without restructuring the code.
This works in several modes: firstly in an entirely distributed multiprocessing mode where each process does its own thing followed by an eventual reduction, secondly using
SharedArrays, where workers on each node populate a
SharedArray followed by an eventual reduction, and thirdly coupled with multithreading, where workers on each node use threads to obtain a result which is subsequently reduced across nodes using multiprocessing. There are examples of usage in the documentation, which for some reason I’m not being obtain the latest version of, but hopefully that’ll be sorted soon.