WIP: faster string sort

xiaodai · December 10, 2017, 11:54am

I am working on making sorting of strings faster. The progress so far is that I can sort strings of 8 bytes in size as fast as radixsort in R (which is the fastest algorithm to sort strings found in the R/Python/Julia-verse).

The first set of demo code can be found in the README of FastGroupBy.jl which I shall reproduce here for preservation.

I will now work on a more general algorithm that can sort strings of any length. But I think the current approach might yield solutions that are a bit slower than the C implementation in R, but we shall get close.

Code to produce timings

# Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git") # run once to install
using FastGroupBy

const M=100_000_000; const K=100
srand(1)
svec1 = rand(["i"*dec(k,7) for k in 1:M÷K], M)
@time radixsort!(svec1) #18 seconds
issorted(svec1)

and the equivalent R code is

N=1e8; K=100
set.seed(1)
library(data.table)
id3 = sample(sprintf("i%07d",1:(N/K)), N, TRUE)
pt = proc.time()
system.time(sort(id3, method="radix"))
data.table::timetaken(pt) # 18.9 seconds

JeffreySarnoff · December 12, 2017, 2:45am

You will get better timing information using BenchmarkTools and @btime. The argument to the function being timed with @btime has a $ prepended:

Pkg.add("BenchmarkTools") # if needed
Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git") # if needed

using BenchmarkTools
using FastGroupBy

const M=100_000_000; const K=100; srand(1);
svec1 = rand(["i"*dec(k,7) for k in 1:M÷K], M);

@btime radixsort!($svec1);

sdanisch · December 12, 2017, 10:36am

You should try 0.7! For me it goes from 4.9s to 3.2s (note, I had to copy & paste the algorithm out of the package, since the dependencies don’t seem to work on 0.7)

xiaodai · December 12, 2017, 10:37am

How powerful is your computer?

sdanisch · December 12, 2017, 10:38am

Just kidding, I have a Intel® Core™ i7-6700 CPU @ 3.40GHz × 8

xiaodai · December 12, 2017, 10:45am

I think BechmarkTools.jl is not accurate in this case? It seems to report timings for when svec1 is sorted which is the case after running it once. But if svec1 is not sorted it does take 18 seconds on mine.

kristoffer.carlsson · December 12, 2017, 10:55am

This won’t work well because the vector gets modified after the first sort command and then you are only looking at the time it takes to sort an already sorted vector.

Create a second vector, copy from the first to the second and sort:

julia> svec2 = similar(svec1);

julia> @btime (copy!(svec2, svec1); radixsort!(svec2));

And then benchmark only copy! to see if it is significant (it shouldn’t be).

JeffreySarnoff · December 12, 2017, 10:58am

thank you for the information

stevengj · December 12, 2017, 12:58pm

@time should be accurate for things that take > 1s, as it does here. @btime performs the timing many times and takes the minimum to reduce noise, but this isn’t necessary for long-running code.

JeffreySarnoff · December 12, 2017, 1:10pm

I have been getting very different results from @btime and @time on many kinds of short duration functions. It is good to know that should no longer be a problem.

kristoffer.carlsson · December 12, 2017, 1:12pm

Yes, but what Steven is talking about is long duration functions.

rdeits · December 12, 2017, 4:52pm

Rather than trying to estimate the time spent in copy!(), you can just use BenchmarkTools’ built-in setup argument which is always run before the code to be timed:

@btime sort!(x) setup=(x = rand(100)) evals=1

Note that evals=1 is important to ensure that the sort!(x) call is run exactly once after each setup call.

xiaodai · December 18, 2017, 1:25am

I am having trouble making a string sort algorithm as fast as a C implementation. Here are my findings:

I can access the nth byte using codeunit(a_str, n) so I can perform radixsort on the nth digit; however some strings in a vector is shorter than length n in that case I need to check its length first e.g.

function codeunit_check_length(a_str, n)
  if n <= length(a_str)
    return codeunit(a_str,n)
  else
    return 0x00
  end
end

const M = Int(1e8); const K = 100;
srand(1)
svec1 = rand(["i"*dec(k,rand(1:7)) for k in 1:Int(M/K)], M)

however its run time is 9 seconds

@time codeunit_check_length.(svec1, 8)

so it isn’t performant enough to sort string vectors with up to 24 strings. R’s C implementation allows me to sort the 100 million strings in 20 seconds.

I can use the unsafe_load(Ptr{UInt64}(pointer(x))) method to read in 8 bytes at a time and that’s how I managed to match R’s performance for 8-byte strings. But the algorithm is slower than R’s when the strings are not uniform in length.

I think for Julia to have an easy to write pure Julia implementation I need faster access to the bytes or a method to load the bytes faster for strings with less than 8 bytes.

Is there a way to do this?

stevengj · December 18, 2017, 1:35am

You want sizeof(a_str) here, not length. length is both wrong and slow (it is O(n) because it needs to count Unicode characters).

xiaodai · December 18, 2017, 1:51am

Yes. In the actual implementation, I did use sizeof. Still the timings are about 4 seconds for 100 million rows. So still need something faster.

stevengj · December 18, 2017, 2:20am

Also, don’t time in global scope. Do

foo(v) = codeunit_check_length.(v, 8)
@time foo(svec1)

if not @btime foo($svec1).

Also, note that you are timing memory allocation of the output vector etcetera. I’m not clear on how exactly you are deciding how fast you need this to be.

stevengj · December 18, 2017, 2:24am

Would be better to hoist the check outside of codeunit_check_length anyway. i.e. better to optimize the functions that call codeunit in order to handle truncated-length strings.

ScottPJones · December 18, 2017, 3:20am

You might want to try using @inbounds?
codeunit_check_length(str, i) = @inbounds return i <= sizeof(str) ? codeunit(str, i) : 0x0

nalimilan · December 18, 2017, 9:10am

Again, are you completely sure the R code actually works with the string contents rather than with pointers to the global string pool? Obviously any code will have the same problem with strings of different lengths, not just Julia.

xiaodai · December 18, 2017, 9:59am

I have been thinking about this. Even if they have global string pools they still need to establish the relative order of each string. So I don’t see how the global string pool will provide speed up. Also vanilla sort(strings) in R is also slow, only sort(strings, method="radix") is fast. So I think it’s got to do with fast access to underlying bytes.

I am trying a couple more ideas in Julia based on the discussions here. More to come…

Topic		Replies	Views
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4695	January 29, 2018
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4787	February 4, 2018
[ANN]: WIP Strs.jl package ready for alpha review and testing Community	15	3307	April 3, 2018
Request for comments: a roadmap to faster sort() for primitive types using RadixSort Internals & Design sort	32	3387	November 14, 2021
Is `Vector{UInt8}(string)` the fastest way to convert String to bytes? Performance	19	4679	December 10, 2017

WIP: faster string sort

Related topics