Request for comments: a roadmap to faster sort() for primitive types using RadixSort

At the very least it would be useful (maybe as a first step) to improve RadixSort in SortingAlgorithms to make it faster (if you say that’s possible) and extensible (currently there’s no mechanism to declare whether a type supports it or not). Unfortunately, SortingAlgorithms doesn’t get a lot of attention so it’s not always easy to get PR reviewed (like mine about radix sort).

data.table in R is famous for its split-apply-combine performance, which is based on radix sort, so apparently sorting performance does matter in some areas. That said, I’ve not been able to get a speedup in DataFrames by using radix sort for grouping, but I’m not sure whether that’s because our radix sort isn’t fast enough yet or because our other grouping methods are already fast. :slight_smile:

7 Likes

Nice.

What do you have in mind (with “the standard sort order”)? You mean ascending; codepoint/ASCIIbetical order? I have my own idea to fix that, by storing an (8-byte) prefix of strings, but with it modified to be suitable for better ordering. Should be compatible with radix-based sorting (but yes, you only get one such standard order, from a global locale setting, other most often not needed, except for “same” descending order that should neither be a problem).

It doesn’t have to be, you can use in-place radix sort or some variant of it, e.g. this Ska sort variant I just discovered:

this is of course a version of radix sort. Meaning its complexity is lower than O(n log n). I made two contributions:

  1. I optimized the inner loop of in-place radix sort. I started off with the Wikipedia implementation of American Flag Sort and made some non-obvious improvements. This makes radix sort much faster than std::sort, even for a relatively small collections. (starting at 128 elements)

  2. I generalized in-place radix sort to work on arbitrary sized ints, floats, tuples, structs, vectors, arrays, strings etc. I can sort anything that is reachable with random access operators like operator[] or std::get. If you have custom structs, you just have to provide a function that can extract the key that you want to sort on. This is a trivial function which is less complicated than the comparison operator that you would have to write for std::sort.

[…]
My other best case is on already sorted sequences: In that case I iterate over the data exactly twice, once to look at each item, and once to swap each item with itself.

The worst case for my implementation can only be reached when sorting variable sized data, like strings.
[…]
In summary, Ska Sort:

  • Is an in-place radix sort algorithm
  • Sorts one byte at a time (into 256 partitions)
  • Falls back to std::sort if a collection contains less than some threshold of items (currently 128)
  • Uses the inner loop of American Flag Sort if a collection contains less than a larger threshold of items (currently 1024)
    […]
  • Automatically converts signed integers, floats and char types to the correct unsigned integer type
  • Automatically deals with pairs, tuples, strings, vectors and arrays by sorting one element at a time
  • Skips over common prefixes of collections. (for example when sorting strings)
  • Provides two customization points to extract the sort key from an object: A function object that can be passed to the algorithm, or a function called to_radix_sort_key() that can be placed in the namespace of your type

So Ska Sort is a complicated algorithm. Certainly more complicated than a simple quick sort. One of the reasons for this is that in Ska Sort, I have a lot more information about the types that I’m sorting. In comparison based sorting algorithms all I have is a comparison function that returns a bool. In Ska Sort I can know that “for this collection, I first have to sort on a boolean, then on a float” and I can write custom code for both of those cases.
[…]
PROBLEMS
Ska_sort isn’t perfect and it has problems. I do believe that it will be faster than std::sort for nearly all data, and it should almost always be preferred over std::sort.

The code here has a good license, and a bit over 1400 lines: GitHub - skarupke/ska_sort

See also performance graphs for above, also here:

Others to consider:

There are two codebases I consider as important to be aware of, both MSD Radix Sorts (I’m in the LSD camp), but both critically relevant. […]
For smaller inputs (and particularly non-uniform inputs) I start coming out ahead (and will be getting even better). When I say I run faster, I mean A LOT faster.

Another candidate: ("Ultra fast MSD radix sorter), at over 14000 lines I’m not sure we want such large one in base (and it’s GPLed): https://github.com/refresh-bio/RADULS/blob/master/Raduls/sorting_network.cpp

1 Like

! signals mutation, not in-place-ness. sort is guaranteed not to modify its input. sort! is guaranteed to produce a sorted array in the original input vector. Neither is guaranteed to run in place (and indeed @btime sort!(v) setup = (v=rand(["hello","world"], 1000) uses merge sort which is not in place. Radix would behave similarly to merge sort.

Perhaps that is a good upgrade. The algorithm I’m currently using is only 33 sloc

and gets pretty good performance improvements. If this works out, we could probably replace it with a longer better optimized algorithm, but for now I’m trying to scope this with “doable” and “good” preeminent without “best” or “optimal” as objectives.

I’ll stay out of this discussion for now, but while I’m drafting I’ve made a PR to SortingAlgorithms.jl because their CI tests run faster, and for me developing on Julia Base is a bit of an inconvenience (revise works 95% of the time, but if and when I need to rebuild Julia it takes upwards of an hour on my computer). While this PR is nominally targeted at SortingAlgorithms.jl, I plan to stay within the SymVer guarantees of Julia and the existing API, so it should be easy to put in either place.

3 Likes

While this is a total rewrite, it is heavily inspired by SortingAlgorithms’.

For folks familiar with internals, if views were as fast to work with as manually passing around lo and hi issue, the whole sorting system could feel a bit more idiomatic and nicer to work with.

1 Like

Google Scholar for scholarly bibliography and reading list on fast radix sorting.
The Art
Of Computer Programming Volume 3 Sorting and Searching Second Edition Donald E Knuth for a very detailed discussion on radix sorting.

1 Like

Thank you @iajzenszmi! The particular niche I’m trying to fill with this PR now is in-memory sorting (i.e. sorting datasets that fit into RAM). I found the external radix sorting section 5.4.7 on page 343, but couldn’t find a section on in-memory radix sorting. Could you point it out to me, please?

EDIT: found it. 5.2.5 on page 168 “sorting by distribution”

According to my reading, Knuth presents 1) the classic LSD radix algorithm I’ve implemented in the linked PR, 2) an option to combine distributing & counting the next bucket sizes, and 3) a queue implementation to replace counting altogether.

When deciding what to implement, I chose (1) because I suspected that (2) and (3) yield no more than small constant factor improvements, both make the code longer and less intuitive, and I suspect but have not tested, that (2), and perhaps (3) increase the minimum N for which radix is faster than quicksort or mergesort.

In my reading, Knuth presented a simple algorithm and a more complex algorithm with the same theoretical runtime and asserted that the complex algorithm was faster without empirical evidence. Do you happen to know of any real-world comparisons of these two or three methods? Because if significant performance improvements are available, it may be worth adopting a more complex method.

Alternatively, feel free to implement one of those methods and benchmark it against the one I’m proposing—if it’s faster it would probably make sense to use it instead!

In theory, it should be easy to swap out the underlying radix sort algorithm while keeping all the same plumbing to make it compatible with Julia’s comparison-based paradigm.

In general, I tend not to trust Knuth for performance claims. They usually were right at the time he made them, but most of them come from experience with computers from the 80s which have a very different cost model to modern ones.

1 Like

Long ago, I implemented a sophisticated radix sort for (ascii) strings and primitive types in java. I will sketch the ideas I used to improve the basic radix sort algorithm, maybe you can adopt some of them.

(1) Serialization:

instead of explicitly computing a serialization to some AbstractVector{UIntN}, I prefer a function sortbucket(element, pass)::UIntN which computes the value of the serialization vector for element at index pass. For primitive types and ascii strings, this is easily done, mostly using AND and SHIFT operations. For UInt64 elements and UIntN==UInt8, you could use for example

sortbucked(element,pass) = (element >>> ((8-pass)*8)) & 255

For each (element,pass) pair, the basic radix sort will call sortbucket(element, pass) 0 or 1 times. 0 times happens if a pass has only one element to sort, for all remaining positions > pass. This can add up to a significant percentage.

If you explicitly construct the serialization, you call sortbucket(element, pass) once for all defined pass values of element. And for each sortbucket(element,pass) you have the additional effort of storing the result and reading it back in a radix sort pass.

I did not benchmark it, but in Julia and for primitive types, I expect on-the-fly computation with sortbucket to come close to an array access or maybe run even faster, because sortbucket will compile down to to only a few primitive operations on CPU registers. On sorting primitive types, I expect significant overall runtime improvements >10% when a sortbucket function is used without storing the serialization.

There are exceptions where an explicit serialization is better, in cases where sortbucket(element, pass) needs state information from sortbucked(element,pass-1). This is not relevant for primitive types, but for string searches, if you need to implement ordering rules of a natural language. For German, as an example, the character “ö” has to be considered equal to the sequence “oe”, according to DIN 5007 variant 2. A serialization will replace non-ascii characters like “ö” by their ascii replacement string like “oe”.

(2) a priori min and max limits

Your suggestion is to use the serialization pass also to compute min and max values for sortbucket(.,pass). My recommendation: instead of explicitly computing min and max for sortbucket(.,pass), use a priori limits amin and amax, derived from the sortbucket implementation. In my example, we have amin==0 and amax==255.

In particular for small sortbucked ranges like byte ranges, it does not pay off to compute exact min and max: the only gain of an exact min and max is, that we can make the counting array in the radix pass a little bit smaller, saving some memory and its initialization to zero. ​

(3) selection of UIntN dependent on the number of elements to sort

Assume we have to sort N elements of type UInt64. The serialization into bytes as shown above will result in (up to) 8 radix sort passes, using small count arrays of size 256. If we serialize into UInt16, we will arrive at only 4 radix sort passes, but the count array will have 2^16 elements. For huge N, the reduction of radix sort passes will overcompensate the initialization of the count arrays, and memory consumption of the count arrays will be negible, compared to the memory needed to store the elements to sort.

For finer tuning, consider sortbucket to extract bit sequences instead of byte sequences. We could serialize UInt64 for example into 5 buckets of 13 bits. For N near 2^13, I expect 5 passes with count arrays of size 2^13 to be faster than 4 passes with count arrays of size 2^16, or 8 passes with count array size 2^8. For primitive integer types, it is quite easy to implement a function sortbucket(element,pass,bits) having amin==0, amax==1<<bits-1.

With some benchmarks, a formula for a near optimal value of bits can be derived at least for primitive types. Using your concept of serialization into AbstractVector{UIntN}, my suggestion is to choose the type UIntN dependent on the number of elements to sort.

(4) estimate min, max and sortbucket distribution with a random sample

If amax-amin is expected to be much larger than max-min, consider using estimates emax and emin which are computed on a small random sample of elements. We have
amin <= min <= emin <= emax <=max <=amax. In the radix sort pass, use the projection of sortbucket(element,pass) on the interval emin:emax (values smaller than emin become emin, values larger than emax become emax), and record min and max values in this (sub)pass. If min<emin, we have to add an additional pass for the sortbucket values projected onto emin, but for this path, we have the exact min, max values. The same applies for the case max>emax.

Instead of using the sample min and max values, it is often better to choose emin and emax such that “all values but some outliers” are covered. When sorting primitive types, I used such a sampling to determine the relevant value range. It is not uncommon that an Int64 array of values to sort turns out to have 99% of its values in an interval of a size which can be efficiently radix sorted in one or two passes.

1 Like

@rryi, Thank you! I appreciate your suggestions and feedback! It will take me a bit to fully take in what you are sharing here, but here’s my initial reaction:

This is what SortingAlgorithms.jl does now. They call their function uint_mapping.

How do we detect this shortcut without substantial overhead in the case where it is not taken?

The way I’m doing it serializes the full input array in a single pass. It leaves UInts alone, but futzes with the bits in Ints via serialize(::ForwardOrdering, x::Signed) = unsigned(xor(x, typemin(x))) so that if x < y, we’ll have serialize(x) < serialize(y). We then discard the original array! (or perform serialization in place) and sort the serialized UInts. This sorting does use a (manually inlined) version of sortbucketed, implemented as you suggest as (x >> shift)&mask. The net result is the same number of bit-shifts and bitwise ands as the sortbucketed or uint_mapping method, but less futzing with bits to put signed integers into UInt ordering.

This method then requires a final deserialization pass to convert the sorted UInts back to their original type and value.

Because we only sort the UInts, not both the UInts and the originals, there is no additional array access to compete with.

A potentially avoidable weakness of serialization as I’m doing it is that I serialize to a fixed width UInt, which precludes the possibility of sorting unbounded width data. Perhaps there is a clever way around this.

Not quite, again there is only a single serialization pass (potentially a compression pass) and then multiple sorting passes and a single deserialization. My suggestion is to use the serialization pass to determine the minimum and maximum of the entire dataset, with the goal of figuring out how many bits are actually used (e.g. folks only using 25 bits of their Int64 most of the time)

I agree! My current approach performs a single serialization pass and then reports results to the heuristic which may or may not defer to counting sort. A more specific specialization would be to skip the serialization and heuristic entirely for large byte arrays and all boolean or bit arrays.

SortingAlgorithms.jl currently uses 11 bit chunks, and I use a heuristic which takes as input the total used bitwidth of the input and N, and estimates the optimal chunk size.

In part because they are tucked away into the serialization pass, I suspect the marginal cost of computing exact minimum and maximum values is very low. Further, it avoids out of bounds checking in subsequent sorting passes.

I’m worried this will take longer than guaranteeing we get the right answer the first time

Hmm, I like this. For example, if we’re sorting [1,2,3,3,3,7,22,17,typemax(Int)], we should use a single 5-bit pass with an outlier. I’d probably need to do benchmarking with real world data to be convinced this is a substantial savings, though, because to meaningfully reduce the chunk size or number of passes requires shrinking bit width by a decent amount (e.g. 4-8 bits) which is a factor of 16-255. I suspect it is rare to have a dataset as lopsided as [1,2,3,3,3,7,22,17,2_000], and even for that level of lopsidedness, I’m not sure the special case would be worth it. It is certainly promising, though!

I wish someone created a syntax for non allocating function which is what’s really important for performance.
Something like FunName!!() maybe?
Then I’d say it would be great to make all base functions have a version which accepts pre allocated buffers from outside.

Remark: This takes discussion to a detour but it makes sense as a solution if we use allocating sort algorithm to be able to set input for the allocated buffers.

2 Likes

I think this is a good idea, and created a new thread for it: How to support passing temporary buffers around

Maybe I misunderstood your question - to me, it looks mostly trivial and has no additional effort.
A radix pass starts with counting the buckets. We initialize an array with index range min:max to zero and increment values like (pseudocode)

for i in min:max
  count[i] = 0
end
for i in firstElementIndex:lastElementIndex
  count[sortbucket(element[perm[i]],pass)] += 1
end

perm is the current index permutation which is changed to become finally the sequence of
sorted elements.

We then have to work down the buckets:

for b in min:max
  c = count[b] # c ist the size for the next pass
  if c < limitForRadixSort
    if c < 2
      # this is the case in question: we are done, no further pass necessary, bucket elements sorted
    else
      # we use another sorting algorithm for small sets, because radix sort is inefficient
      # if max-min for the next pass is much larger than the number of elements
    end
  else
    # we have to radix sort the elements in the bucket with the next pass
  end
end

From my point of view, we definitively have to check each bucket count and to decide how to sub-sort the elements belonging to that bucket.

The way I’m doing it serializes the full input array in a single pass. It leaves UInt s alone, but futzes with the bits in Ints via serialize(::ForwardOrdering, x::Signed) = unsigned(xor(x, typemin(x))) so that if x < y , we’ll have serialize(x) < serialize(y) .

That xor is cheap (typically 1 CPU cycle). For primitive types, the expensive part of a
sortbucket(element,pass) call is the array access to read element from its source array. All operations within sortbucket are register operations with very few CPU cycles, and of course I assume sortbucket(,) gets inlined.My code recommendation is like

sortbucket(e::T,pass) where T <: Signed = sortbucket(unsigned(xor(e,typemin(T))),pass)

We then discard the original array! (or perform serialization in place) and sort the serialized UInts

I prefer sorting an index permutation for the sorted data. In this case, you cannot discard the original array. The sort algorithm gets only a reference of the original data, and does not own it, usually original data is needed later on in the calling application.

This method then requires a final deserialization pass to convert the sorted UInts back to their original type and value.

Hmm. So you add two full passes - one for serialization, one for deserialization? If we need only 1 or 2 radix sort passes, I thing “on-the-fly serialization” is much faster.

Another point: we do not really need a serialization for radix sort, which can be deserialized. We need a bit sequence which has the same ordering property than original data. Consider my string sort example of German strings where you replace “ö” by “oe” in the serialization. You will not arrive at the original data if you “deserialize” by replacing all “oe” by “ö”. To allow correct deserialization, you must include some “hint” which occurrences of “oe” have to be replaced by “ö” on deserialization, and that adds a lot complexity.

Sorry, my misunderstanding of your AbstractArray{UIntN}. I thought you meant using one of Vector{UInt8}, Vector{UInt16} … as concrete type. This would require an additional array access.

Now I understand that you think of a type Serialization{UIntN,T} <: AbstractArray{UIntN}, a serialization type per data type of the original data to sort, and you use a different memory layout for the serialization of different data types, for primitive Signed you use the corresponding Unsigned as backing storage, and access “array elements” with shift/and operations like in my sortbucket(…) example. Than you indeed have no additional array access instead of some shift/and extracting the bits from the serialization instance. But you do still have two more array accesses per element: you must read all elements and write the serialization UInt per element in a Vector{Serialization{UIntN,T} }, due to your extra serialize pass.

And why do you implement all those AbstractArray types? Is your serialization data type also used outside the radix sort? For radix sort, a method like sortbucket(…) is enough, you do not need the full AbstractArray API, and a full implementation of the AbstractArray interface is quite a lot of work.

And my suggestion is to avoid any additional pass, using aprio limits instead of computing min and max.

If N is the number of elements to sort, and M is number of possible values aka max-min+1, a radix pass has the computing complexity of O(N)+O(M). O(N) covers the counting path, and the following bucket sort pass, both basically a loop over N elements. O(M) covers the initialization of the counting array and the loop checking the buckets. If N is huge, compared to M, accept M built with apriori limits without any additional effort. An additional path of complexity O(N) which reduces M even to small fraction, does not pay off. If you have to sort 2^30 Numbers of type Int16, just go with min==typemin(Int16) and max==typemax(Int16) and one radix pass.

If M is huge, compared to N, we have a radix sort problem. The original radix sort algorithm is less efficient than classical sort algorithms, if M is huge, compared to N*log(N). This happens in two different cases:

(a) small N.
Solution: switch to a simple, classical sort procedure, maybe even bubblesort.

(b) huge M, but still large N
This is the case with string sort or BigInt sort (apriori M is infinite), but also sorting Int128 or Int64 in today-s IT architectures of a few TBytes of RAM.

Solution 1: your min/max determination. You always do it (compute min/max), but it adds one pass of complexity O(N). You will have a substantial gain if the application programmer did choose a memory-wasting format. And I am pretty sure there are many developers simply using Vector{Int} without thinking about the consequences of a restriction to Vector{Int16}.

For such cases, you could add two parameters to your external sort method, so that the user can supply application-specific tight apriori min and max values.

Solution 2: multiple radix passes.

We agree on using n most significant bits, and to choose n dependent on N. Using the bitwidth (M in my terminology) as additional input is plausible, if you use the same bit count for all passes, and want to avoid a final pass with very few bits. Using 7 bit chunks for UInt64 would lead to 10 passes, 8 bit chunks to only 8 passes. However I use different chunk sizes in different subpasses, so it does only apply if the 1st pass uses most but not all bits, leaving too few bits for pass 2.

My basic heuristic is: choose n as WEIGHTlog2(N), WEIGHT a parameter optimized by benchmarks. Rationale: the resulting radix pass has complexity O(N) + O(WEIGHT2^n) = O(N) + O(2^log2(N)) = O(N)+O(N) = O(N).

As result from this pass, we have partitioned our data into 2^n buckets, each bucket collects all elements with n identical most significant bits. Each bucket is treated as a new sorting task.

One extreme case is, that almost all elements are in one bucket, and this continues in the next passes, At first glance, it looks like a wasted pass, and it may happen if you do not calculate min and max. Having quicksort in mind, one might think it is the worst case, like in quicksort if you choose always min or max as pivot element and performance degenerates to O(N*N). But this is not true, on the contrary: if most buckets have 0 or 1 element, this is a lucky case, because most subpasses are trivial, the remaining big bucket has complexity O(N), according to my heuristic. We arrive at O(log(M)*N) total complexity, if all elements are equal and this in one bucket.

The other extreme is:elements are uniformly spread over buckets, each bucket has about N>>n elements. We have 2^n subproblems to solve.

If we use the same number of bits in 2nd pass, effort explodes: each bucket needs O(N) effort, due to the high bit count. Total effort is O(N)*2^n = O(N)WEIGHT**2^log(N) = O(N)WEIGHTN = O(NN) - and sorting is not done, only pass 2.

Consequence: in subsequent passes, we should use a smaller number of bits, applying the heuristic again.

This weakness applies here as well. My suggestion: use a method signature like sortbucket(element,firstbit,bitcount) instead of sortbucket(element,pass) or serialization[pass]. This variant is also efficiently implementable with SHIFT and AND, and gives you full control on how many bits you use per pass, you can adjust the bit count easily with your heuristic.

Maybe. The larger the sample, the higher is the probability that the number of elements outside emin:emax is very small and can be efficiently sorted by a classical sort algorithm. However if it is >1, we have an extra sort task to solve. The number of comparisons to check for min/max is nearly identical: once for every element in the data to sort, plus a very small fraction of that for the sample processing.

I think using a sample to identify an interval containing almost all elements is more promising. I used it for string sort on UTF16 coded strings which is the java string format: the sample tells me if I have mostly ASCII with/without end of line codes, or a substantial part of codes in the upper unicode segments. In some tests with up to 2^20 strings in European languages and a radix pass per character position it turned out to be faster.

1 Like

I believe it is I who made the original misunderstanding. When you said “if a pass has only one element to sort” I thought you meant every element was identical, but I now think you meant that every element falls into the same bucket. Because of the compression passes, I think that we avoid these wasteful passes much sooner and don’t even perform a count on them. The only case I can think of where this would come up is sorting something like this [101000000, 11000000, 011000000] or [10100000000001, 11100000000010, 100000000101] where there is a chunk of bits that are all the same. I doubt this is a common enough case to add that check for, but if someone can provide a real-world dataset with measurable performance impact, I’d have to think harder about it.

I’ve implemented LSD radix (i.e. first sort by the least significant chunk, then by the second least significant, then the third, so on and so-forth) so that we do not need O(n) function calls—only O(log(n)). The first pass sorts the entire array as one, the second does the same, but looks at slightly more significant bits.

counts = Vector{BinType}(undef, 2^chunk_size+1)
mask = 2^chunk_size-1

for shift in 0:chunk_size:bits-1

    counts .= zero(BinType)

    for x in vs
        idx = (x >> shift)&mask + 2
        counts[idx] += one(BinType)
    end

    sum = counts[1]
    for i in 2:2^chunk_size
        sum += counts[i]
        counts[i] = sum
    end

    for x in vs
        i = (x >> shift)&mask + 1
        j = counts[i] += 1
        ts[j] = x
        #sortperm here
    end

    vs, ts = ts, vs

end

Outside of loop breaking, it happens to be entirely branch-free, but I doubt that makes a substantial difference because the branches associated with special case detection would also be easily predicted by a branch-predictor, and where there were mis-predictions, the special case would make it worth it.

This is precisely what I specify in the docstrings of my serialization methods (see OP)

No. I do not define any types for serialization (see OP for implementation, if you’d like, though later I will have a heavily commented implementation for a PR that will be more readable)

Yes. I serialize Vector{Int32} to Vector{UInt32} and Vector{Float64} to Vector{UInt64}.

This is a valid objection. And indeed these two passes are both unnecessary. The first serialization can be a serialize & radix pass and the deserialization can also be a deserialize & radix pass. Given the marked performance improvements of the system I have now over what is currently in Julia/Base I have decided to focus on simplicity & readability, but I imagine removing these extra array access at some point.

A slightly more accurate runtime analysis is N/n + WEIGHT * 2^n, which indicates lambert’s w instead of log2. I currently minimize N/n + 2^n but I plan to perform extensive benchmarking and potentially reconsider.

LSD radix avoids this entirely

Which places the approximation at strictly more comparisons than getting the right answer the first time. Do you think that the middle 90% of many datasets can be represented in substantially lower bit-width than the entire dataset?

As I’ve said before, this is not for variable width sorting (including string sorting).

You have alluded multiple times to a “way that you do it” do you happen to have a legible implementation I could read? Ideally with the ability to benchmark it or implemented in Julia? If you don’t have an implementation, but do have a full algorithm in mind, it might help clarify what you mean if you describe your whole idea.

A lot of the questions you raise would be easily answered with benchmarking, and I am currently working on a sorting benchmarking package (this portion of the project is on hold until that benchmark+test suite is ready to go). Hopefully I can make the implementation legible enough for you to enjoy reading and then you can make your proposals concretely and you or I can test if they result in a runtime improvement!

LSD radix is new to me, but the idea is compelling, at least for sorting fixed-size data. Looks “more radix” than my MSB variant with its switches to other sorting procedures for small buckets. Of course it requires a stable sort in all subpasses (I have an unstable variant which reorders the sorting permutation in place, avoiding allocation of a 2nd permutation array).

Your code snippet shows you do in-place sorting with a data (or serialization) copy, and do not use an index permutation to sort. For small temporary data which needs to be sorted, this is fine. Do you have real use cases for sorting huge data which do not require preservation of original data? Think of tabular data hold in column arrays (I think the dominant design in julia): in-place sorting of only one column would destroy the table, and in-place reordering in all columns is quite expensive.

what function do you mean here? n is the chunk bit width??

Very good idea. It eliminates the passes and the additional array read accesses in them, and due to your in-place sorting, there are no additional array write accesses anymore. Should be substantially faster for cases with few radix passes.

I think having 10% of data outside the lower bitwidth is way too much. If we sort 2^30 elements, even 0.1% is 2^20 elements.

One case I have in mind are developer colleagues who use typemax(DatatypeToSort) as a special value e. g. for “undefined” - as I did in the very distant past :wink:

For a relevant outlier case, consider data values 1/x^2 of a uniformly distributed random variable x in interval (0,1] (in physics: gravitation force between two masses with distance x).

A third case is my UTF16 sorting example, reducing 16 bits to 6-7 bits for about 99% of the cases: according to Alphabet und Buchstabenhäufigkeit: Deutsch , the German “äöüß” have a cumulated frequency of 1.86% among all letters, and there are frequent non-letters like blank.

All those cases could be treated by a suitable data transformation before sorting, but that would require use case knowledge. We are talking about a library implementation which “simply works”, no brain investment needed by the application programmer.

Back to your question: I think it is worth spending up to 0.1% of the total sort effort in doing sample processing to detect cases where we can efficiently reduce bit width a lot. Even if it happens in only 1% of the cases.

Do you have some more info or a link for that analysis? It’s not clear to me why/how you use lambert’s w function (inverse of x->x*e^x).

I admit my heuristic is not an analysis but mostly a gut feeling that my O(M) should become an O(N), “balancing” the bitwidth-related effort with the element-count-related effort. Because LSB radix sort has the same N in all passes, having a constant bit width is plausible, and your heuristics can directly aim at the total runtime complexity including all passes.

This is what LSD radix sort makes so attractive to me.

As I already mentioned, there is an old java implementation. The core implementation has 1185 LOC, and is (I think) well commented. I will send it to you. Feel free to have a look at it, in particular the comments and the sampling implementation. But don’t expect too much, it is really old (Java 1.2).

I totally agree.

It makes no sense to benchmark my old code: it requires my framework for execution, and … I switched to Julia for performance reasons. I expect that java implementation to be much much slower than a reimplementation in Julia. I tried hard to avoid virtual method overhead, but it sacrifices generic code in java. Reading an element from the source data IS a virtual method: not inlineable, indirect call via method table …

Fantastic!
I am really interested in benchmarking my MSD concept against your LSD radix sort. I will reimplement it in an open source julia package, so that it can be included in your benchmarks, and we can review concrete code and ideas to improve both implementations.

The downside: it could take a while. Julia has currently hobby status for me. I will start with a first version this weekend, but … no promises.

2 Likes

Sorry I didn’t clarify. Unless otherwise specified I tend to assume any ns appearing in O(n) refer to input array size. In this case, I mean n = length(input).

We already special-case NaN in a pre-processing pass. Perhaps we should also special case Inf?

using Statistics
used_bits(x) = sizeof(x)*8 - leading_zeros(x)

used_bits(reinterpret(UInt64, Inf) - reinterpret(UInt64, 1.0))
63
used_bits(reinterpret(UInt64, 17.0) - reinterpret(UInt64, 1.0))
55
used_bits(reinterpret(UInt64, 17.0) - reinterpret(UInt64, 0.0))
63

Are 8 bits out of 63 significant? Perhaps, but saving 13% when sorting strictly positive floats that contain Inf is not my top priority, especially if it comes with any detection cost.

x = rand(10^7);
data = 1 ./ (x .^ 2);
a,b,c,d = reinterpret(UInt64, quantile!(data, [0, .05, .95, 1]));
used_bits(d) # Naive required bit-width
63
used_bits(d-a) # Adding a compression (can be coincident with the second sorting pass) pass subtracting out the minimum value
58
used_bits(c-b) # Only sorting the middle 90% of the data
56

2 bits isn’t substantial in my book.

I’m working on fixed bit-width.

Yep! If the input array has length n and we need to sort a bitwidth of m, we get to choose a chunk length (in bits) x. We now need to conduct m/x passes, each of which takes about k1*n time to go through the array, and k2*2^x time to clear out and compute a cumulative sum on the count vector. Looking at my implementation in post 27, we see this runtime evident in the loops for x in vs and for i in 2:2^chunk_size. Now we have, approximately runtime = (m/x)*(k1*n+k2*2^x). Compute the derivative with respect to x:

d/dx[(m/x)*(k1*n+k2*2^x)]
= (m/x)*d/dx[(k1*n+k2*2^x)] + d/dx[(m/x)]*(k1*n+k2*2^x)
= (m/x)*d/dx[(k2*2^x)] + (-m/x^2)*(k1*n+k2*2^x)
= (m/x)*k2*d/dx[(2^x)] + (-m/x^2)*(k1*n+k2*2^x)
= (m/x)*k2*log(2)*(2^x) + (-m/x^2)*(k1*n+k2*2^x)

and set it to zero to find a minimum

0 = (m/x)*k2*log(2)*2^x + (-m/x^2)*(k1*n+k2*2^x)
(m/x)*k2*log(2)*2^x = (m/x^2)*(k1*n+k2*2^x)
x*k2*log(2)*2^x = (k1*n+k2*2^x)
(x*k2*log(2)-k2)*2^x = k1*n
(x*k2*log(2)-k2)*e^(log(2)*x) = k1*n
let y = log(2)*x
(k2*y-k2)*e^y = k1*n
(k2*(y-1))*e^y = k1*n
let z = y-1
(k2*z)*e^(z+1) = k1*n
(e*k2*z)*e^z = k1*n
ze^z = k1*n/k2/e
Solve for z using lambert's w function and conclude
x = y/log(2) = (z+1)/log(2) = (lambertw(n*k1/k2/e)+1)/log(2)

Lovely! This is great! I look forward to seeing a benchmarkable Julia implementation! I’ve reached basic functionality for sorting benchmarks with GitHub - LilithHafner/SortMark.jl, and plan to robustly test and benchmark my proposed radix sort soon.

Sorting in Julia also has hobby status for me, so no promises here either.

My example was meant representing the data values as integer, not real. Converting them to Float32 is already a transformation to speed up sorting if exact integer values are not needed.

Taking your example code, I get for this code

using Statistics
x = rand(10^7);
data = 1 ./ (x .^ 2);
a,b,c,d = Statistics.quantile!(data, [0, .001, .999, 1]);
#used_bits(d) # Naive required bit-width
println("min=",a,", 0.1%=", b, ", 99.9%=",c,", max=",d)

min=1.0000002680664417, 0.1%=1.0020040320720027, 99.9%=1.0143054653596535e6, max=1.6971240550316282e15

If you truncate (not reinterpret!) those reals to UInt64 (maybe you need to replace too large real values by typemax(UInt64)), we have (rounded) min=1, emin = 1, emax = 1e6, max = 1e15.

In other words: 99.9% of the values can be expressed in about 20 bits, all values need about 50 bits, we have a reduction by 30 bits or more than 50% of the bits.

Calculating min, max (or estimates for them) for floating point numbers which are to be sorted by radix sort does not help to reduce the number of significant bits in the serialization, because you could only spare very few bits of the exponent representation. For floats to sort, I would always go with apriori values, no min/max calculation.

That would incorrectly compare 1.1 and 1.9 as equal, right?

A misunderstanding?

My point was not about sorting a Float64 data set by truncating the Float values to Int values, and sorting those Int. We totally agree that this would not lead to a corrrect sort order for the original Float dataset, due to precision losses.

You asked for examples where sampling could reduce the number of bits for radix sort significantly, leaving out a small fraction of cases which have to be sorted by other means. I gave some examples of data sets which have this property, in my opinion.

1/x^2 is an example of a transformation leading to a distribution with extreme outliers, when applied to uniformly distributed values in interval (0,1]. If we have an integer dataset which conforms to such a distribution with extreme outliers, there are cases where you can reduce bit width by 50%, when 0.1% outliers are removed.

Nice! Thanks for the proof sketch.

In domains where floor(1/x^2) is a transformation that occurs with reasonable frequency, I suppose a majority-outlier approach may be warranted. (still only if detection cost is minimal)

I’m trying to hit every primitive type with a single approach right now, but I think your optimizations may be helpful (pending benchmarks) for specializing between types.