I am working on making sorting of strings faster. The progress so far is that I can sort strings of 8 bytes in size as fast as radixsort in R (which is the fastest algorithm to sort strings found in the R/Python/Julia-verse).
The first set of demo code can be found in the README of FastGroupBy.jl which I shall reproduce here for preservation.
I will now work on a more general algorithm that can sort strings of any length. But I think the current approach might yield solutions that are a bit slower than the C implementation in R, but we shall get close.
Code to produce timings
# Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git") # run once to install using FastGroupBy const M=100_000_000; const K=100 srand(1) svec1 = rand(["i"*dec(k,7) for k in 1:M÷K], M) @time radixsort!(svec1) #18 seconds issorted(svec1)
and the equivalent R code is
N=1e8; K=100 set.seed(1) library(data.table) id3 = sample(sprintf("i%07d",1:(N/K)), N, TRUE) pt = proc.time() system.time(sort(id3, method="radix")) data.table::timetaken(pt) # 18.9 seconds