Maybe I misunderstood your question - to me, it looks mostly trivial and has no additional effort.
A radix pass starts with counting the buckets. We initialize an array with index range min:max to zero and increment values like (pseudocode)
for i in min:max
count[i] = 0
end
for i in firstElementIndex:lastElementIndex
count[sortbucket(element[perm[i]],pass)] += 1
end
perm is the current index permutation which is changed to become finally the sequence of
sorted elements.
We then have to work down the buckets:
for b in min:max
c = count[b] # c ist the size for the next pass
if c < limitForRadixSort
if c < 2
# this is the case in question: we are done, no further pass necessary, bucket elements sorted
else
# we use another sorting algorithm for small sets, because radix sort is inefficient
# if max-min for the next pass is much larger than the number of elements
end
else
# we have to radix sort the elements in the bucket with the next pass
end
end
From my point of view, we definitively have to check each bucket count and to decide how to sub-sort the elements belonging to that bucket.
The way I’m doing it serializes the full input array in a single pass. It leaves UInt
s alone, but futzes with the bits in Ints
via serialize(::ForwardOrdering, x::Signed) = unsigned(xor(x, typemin(x)))
so that if x < y
, we’ll have serialize(x) < serialize(y)
.
That xor is cheap (typically 1 CPU cycle). For primitive types, the expensive part of a
sortbucket(element,pass) call is the array access to read element from its source array. All operations within sortbucket are register operations with very few CPU cycles, and of course I assume sortbucket(,) gets inlined.My code recommendation is like
sortbucket(e::T,pass) where T <: Signed = sortbucket(unsigned(xor(e,typemin(T))),pass)
We then discard the original array! (or perform serialization in place) and sort the serialized UInts
I prefer sorting an index permutation for the sorted data. In this case, you cannot discard the original array. The sort algorithm gets only a reference of the original data, and does not own it, usually original data is needed later on in the calling application.
This method then requires a final deserialization pass to convert the sorted UInts back to their original type and value.
Hmm. So you add two full passes - one for serialization, one for deserialization? If we need only 1 or 2 radix sort passes, I thing “on-the-fly serialization” is much faster.
Another point: we do not really need a serialization for radix sort, which can be deserialized. We need a bit sequence which has the same ordering property than original data. Consider my string sort example of German strings where you replace “ö” by “oe” in the serialization. You will not arrive at the original data if you “deserialize” by replacing all “oe” by “ö”. To allow correct deserialization, you must include some “hint” which occurrences of “oe” have to be replaced by “ö” on deserialization, and that adds a lot complexity.
Sorry, my misunderstanding of your AbstractArray{UIntN}. I thought you meant using one of Vector{UInt8}, Vector{UInt16} … as concrete type. This would require an additional array access.
Now I understand that you think of a type Serialization{UIntN,T} <: AbstractArray{UIntN}, a serialization type per data type of the original data to sort, and you use a different memory layout for the serialization of different data types, for primitive Signed you use the corresponding Unsigned as backing storage, and access “array elements” with shift/and operations like in my sortbucket(…) example. Than you indeed have no additional array access instead of some shift/and extracting the bits from the serialization instance. But you do still have two more array accesses per element: you must read all elements and write the serialization UInt per element in a Vector{Serialization{UIntN,T} }, due to your extra serialize pass.
And why do you implement all those AbstractArray types? Is your serialization data type also used outside the radix sort? For radix sort, a method like sortbucket(…) is enough, you do not need the full AbstractArray API, and a full implementation of the AbstractArray interface is quite a lot of work.
And my suggestion is to avoid any additional pass, using aprio limits instead of computing min and max.
If N is the number of elements to sort, and M is number of possible values aka max-min+1, a radix pass has the computing complexity of O(N)+O(M). O(N) covers the counting path, and the following bucket sort pass, both basically a loop over N elements. O(M) covers the initialization of the counting array and the loop checking the buckets. If N is huge, compared to M, accept M built with apriori limits without any additional effort. An additional path of complexity O(N) which reduces M even to small fraction, does not pay off. If you have to sort 2^30 Numbers of type Int16, just go with min==typemin(Int16) and max==typemax(Int16) and one radix pass.
If M is huge, compared to N, we have a radix sort problem. The original radix sort algorithm is less efficient than classical sort algorithms, if M is huge, compared to N*log(N). This happens in two different cases:
(a) small N.
Solution: switch to a simple, classical sort procedure, maybe even bubblesort.
(b) huge M, but still large N
This is the case with string sort or BigInt sort (apriori M is infinite), but also sorting Int128 or Int64 in today-s IT architectures of a few TBytes of RAM.
Solution 1: your min/max determination. You always do it (compute min/max), but it adds one pass of complexity O(N). You will have a substantial gain if the application programmer did choose a memory-wasting format. And I am pretty sure there are many developers simply using Vector{Int} without thinking about the consequences of a restriction to Vector{Int16}.
For such cases, you could add two parameters to your external sort method, so that the user can supply application-specific tight apriori min and max values.
Solution 2: multiple radix passes.
We agree on using n most significant bits, and to choose n dependent on N. Using the bitwidth (M in my terminology) as additional input is plausible, if you use the same bit count for all passes, and want to avoid a final pass with very few bits. Using 7 bit chunks for UInt64 would lead to 10 passes, 8 bit chunks to only 8 passes. However I use different chunk sizes in different subpasses, so it does only apply if the 1st pass uses most but not all bits, leaving too few bits for pass 2.
My basic heuristic is: choose n as WEIGHTlog2(N), WEIGHT a parameter optimized by benchmarks. Rationale: the resulting radix pass has complexity O(N) + O(WEIGHT2^n) = O(N) + O(2^log2(N)) = O(N)+O(N) = O(N).
As result from this pass, we have partitioned our data into 2^n buckets, each bucket collects all elements with n identical most significant bits. Each bucket is treated as a new sorting task.
One extreme case is, that almost all elements are in one bucket, and this continues in the next passes, At first glance, it looks like a wasted pass, and it may happen if you do not calculate min and max. Having quicksort in mind, one might think it is the worst case, like in quicksort if you choose always min or max as pivot element and performance degenerates to O(N*N). But this is not true, on the contrary: if most buckets have 0 or 1 element, this is a lucky case, because most subpasses are trivial, the remaining big bucket has complexity O(N), according to my heuristic. We arrive at O(log(M)*N) total complexity, if all elements are equal and this in one bucket.
The other extreme is:elements are uniformly spread over buckets, each bucket has about N>>n elements. We have 2^n subproblems to solve.
If we use the same number of bits in 2nd pass, effort explodes: each bucket needs O(N) effort, due to the high bit count. Total effort is O(N)*2^n = O(N)WEIGHT**2^log(N) = O(N)WEIGHTN = O(NN) - and sorting is not done, only pass 2.
Consequence: in subsequent passes, we should use a smaller number of bits, applying the heuristic again.
This weakness applies here as well. My suggestion: use a method signature like sortbucket(element,firstbit,bitcount) instead of sortbucket(element,pass) or serialization[pass]. This variant is also efficiently implementable with SHIFT and AND, and gives you full control on how many bits you use per pass, you can adjust the bit count easily with your heuristic.
Maybe. The larger the sample, the higher is the probability that the number of elements outside emin:emax is very small and can be efficiently sorted by a classical sort algorithm. However if it is >1, we have an extra sort task to solve. The number of comparisons to check for min/max is nearly identical: once for every element in the data to sort, plus a very small fraction of that for the sample processing.
I think using a sample to identify an interval containing almost all elements is more promising. I used it for string sort on UTF16 coded strings which is the java string format: the sample tells me if I have mostly ASCII with/without end of line codes, or a substantial part of codes in the upper unicode segments. In some tests with up to 2^20 strings in European languages and a radix pass per character position it turned out to be faster.