[WIP] MPISort.jl - Distributed MPI Sorting Algorithms. Suggestions welcome!

Hi all,

It was with great horror that I discovered how non-trivial sorting N elements across P clusters without any one cluster being able to hold all elements at once was. Or maybe I’m just thick. Anyways, I only found two such open-source algorithms written in not-so-friendly C++ and Charm++ and surprisingly few papers on the subject.

So here’s MPISort.jl, a Julia package offering mpisort! as an interface for such distributed sorting methods.

One algorithm is included at the moment: sampling with interpolated histograms, or SIHSort (pronounced sigh sort, like anything MPI-related), optimised for minimum inter-rank communication and memory footprint. Features:

  • Does not require that distributed data fits into the memory of a single node. No IO either.
  • Works for any comparison-based data, with additional optimisations for numeric elements.
  • Optimised for minimum MPI communication; can use Julia threads on each shared-memory node.
  • The node-local arrays may have different sizes; sorting will try to balance number of elements held by each MPI rank.
  • Works with any AbstractVector, including accelerators such as GPUs (this needs further testing). Julia type-inference and optimisations do wonders.
  • Implements the standard Julia sort! API, and naturally works for custom data, comparisons, orderings, etc.

I really want to thank the MPI.jl authors for making MPI such a joy to use in Julia; here’s to hopefully more great work in the Julia distributed ecosystem!

I am looking for any feedback on the library design, code, algorithm before I register the package. All suggestions are highly appreciated.

Best wishes,
Leonard

7 Likes

Is this actually faster than a disk based sort? I would think that a disk based merge sort would pretty much always be faster since disks tend to have higher speed than networks, and you are going to be bottle-necked by IO speed.

1 Like

The problem SIHSort solves is slightly different: I want to have all my data in memory, spread across 10-1000s of nodes in a supercomputing clusters. The initial driver for this was a large particle simulation that can be split on 100 clusters, but it can’t all fit on a single one; in this case, sorting redistributes particles across nodes - this needs to be done relatively often, and so constant saving to disk is not desirable. Also, supercomputers almost always have extremely fast interconnects with MPI implementations optimised for their hardware; I’m not sure if IO will be faster in that case.

That said, out-of-core sorting on a single machine was definitely considered as a future addition to MPISort.jl, where a lot of on-disk data must be sorted on a single computer with not enough RAM to hold all of it at once - hence the general interface mpisort! that may employ different algorithms.

4 Likes