By the way what’s the most efficient way to swap around the last 1-8 bits with the 9-16 bits so that the bit order is now 9-16, 1-8, instead of 1-8, 9-16 e.g.
function pre_getidx(vsi, j)
(((vsi >> (j*2-1)*8) & 0xff)) | (((vsi >> (j*2-2)*8) & 0xff) << 8)
end
function getidx(vsi, j)
Int(pre_getidx(vsi,j)) + 1
end
getidx(0x00006162)
# yields 0x00006261
which is what I want but wondering if the above bitshifting approach is the fastest?
The background is that unsafe_load(Ptr{UInt}) can load the underlying bytes reasonably quickly but because I am on a little-endian platform the bytes come out back to front. So if I want to sort by two bytes at once using radixsort I need to “reverse” the bytes as above.
I did a quick implementation of MSD string radixsort in C (btw most C radixsort found on google were “joke” implementations mostly) and it was slow compared to the implementation in Julia. So not very useful as a benchmark since it’s so unoptimised. I also found this lecture which contains some useful algorithms to try.
Here is how I am calling the C radixsort in Julia
cradixsort(x) = ccall(
(:radix_sort,"libuntitled3"),
Void,
(Ptr{Ptr{UInt8}},UInt),
pointer(x), length(x)
)
const M = 100_000_000; const K = 100;
srand(1);
svec = rand(["id"*dec(k,10) for k in 1:M÷K], M);
# yy = svec[1:2_000_000];
@time cradixsort(svec) # 2.5 minutes
issorted(svec)
using FastGroupBy
@time radixsort!(svec) # 18 seconds
issorted(svec)
R’s Internals documentation discusses how R stores the strings and they do use some kind of “cache/look up” table for their strings; probably to save on RAM usage. So it’s not really fair to compare to the R implementation yet.
Two potential advantages are lower RAM usage, and very fast comparison because strcmp becomes a pointer comparison rather than character-by-character. The trade-off is that every string construction then requires a hash table look up.
Those are just synonyms for bswap, such that if you use hton, on a little-endian host, it will convert it to network order (i.e. bigendian), but on a big-endian host, it’s a no-op.
You should use hton not ntoh though, just to keep the semantics correct, even though functions end up being the same on all current machines.
My two cents: I’d mention the fact that R uses string interning directly when presenting the results. People may not read the full post, or may just skip it, in which case they will not understand the full story. Also, interning strings requires an additional computation step that is not accounted for, so this is not a completely fair comparison. For example, if you computed the time required to read a CSV file and sort it, R’s advantage would probably be less striking.
When you mention InternedStrings.jl, you could also point to PooledArrays and CategoricalArrays, which can be useful alternatives when you have a small proportion of unique values. So while it may seem at first that Julia is lacking compared to R because it does not intern strings, you have actually three different choices depending on your needs in Julia. OTC in R you cannot opt out of string interning if it’s not useful for your use case.
I may be missing something about the exercise, but if I have N unique strings among K \gg N values, what’s the point of sorting the K values? Wouldn’t it be better to just sort the N unique values?
Also, if you are thinking about interning and want repeated access to the sorted values, consider sorted containers.
My impression is that you are focusing on low-level optimizations, which is occasionally useful, but rarely gives more than 2-5x improvements in an already optimized language like Julia. Thinking about the broad algorithm usually has higher payoffs; @ChrisRackauckas recently wrote a nice blog post about this.
I’d happy with 2x-5x performance improvement for sorting! The cover photo shows radix sort is faster than what Julia has in Base. I used to test cases were 100m records and R can sort them in 20sec but used to take 60 secs in Julia. Now it take 20~30 seconds in some cases. Julia can appear slow at data manipulation tasks (vs R’s data.table) so this is just one of many steps towards better performance.
Also O(NlogN) is a bound for comparison based sorting and O(wN) is the sorting speed bound for radixsort where w is the number of passes (6 for 64bit data given SortingAlgorithm sorts 11bits at once), and is one of the best known non-comparison based sorts. Happy to find out algorithms that can beat this and get 5x+ performance increase.
For radix sort you have to focus on low-level. I did mention high level optimization that might be possible after nalimilan’s comments though; also I have touched on similar theme’s to Chris’ posts e.g. this.