WIP: faster string sort

Yes, but what Steven is talking about is long duration functions.

Rather than trying to estimate the time spent in copy!(), you can just use BenchmarkTools’ built-in setup argument which is always run before the code to be timed:

@btime sort!(x) setup=(x = rand(100)) evals=1

Note that evals=1 is important to ensure that the sort!(x) call is run exactly once after each setup call.

9 Likes

I am having trouble making a string sort algorithm as fast as a C implementation. Here are my findings:

I can access the nth byte using codeunit(a_str, n) so I can perform radixsort on the nth digit; however some strings in a vector is shorter than length n in that case I need to check its length first e.g.

function codeunit_check_length(a_str, n)
  if n <= length(a_str)
    return codeunit(a_str,n)
  else
    return 0x00
  end
end

const M = Int(1e8); const K = 100;
srand(1)
svec1 = rand(["i"*dec(k,rand(1:7)) for k in 1:Int(M/K)], M)

however its run time is 9 seconds

@time codeunit_check_length.(svec1, 8)

so it isn’t performant enough to sort string vectors with up to 24 strings. R’s C implementation allows me to sort the 100 million strings in 20 seconds.

I can use the unsafe_load(Ptr{UInt64}(pointer(x))) method to read in 8 bytes at a time and that’s how I managed to match R’s performance for 8-byte strings. But the algorithm is slower than R’s when the strings are not uniform in length.

I think for Julia to have an easy to write pure Julia implementation I need faster access to the bytes or a method to load the bytes faster for strings with less than 8 bytes.

Is there a way to do this?

You want sizeof(a_str) here, not length. length is both wrong and slow (it is O(n) because it needs to count Unicode characters).

4 Likes

Yes. In the actual implementation, I did use sizeof. Still the timings are about 4 seconds for 100 million rows. So still need something faster.

Also, don’t time in global scope. Do

foo(v) = codeunit_check_length.(v, 8)
@time foo(svec1)

if not @btime foo($svec1).

Also, note that you are timing memory allocation of the output vector etcetera. I’m not clear on how exactly you are deciding how fast you need this to be.

1 Like

Would be better to hoist the check outside of codeunit_check_length anyway. i.e. better to optimize the functions that call codeunit in order to handle truncated-length strings.

2 Likes

You might want to try using @inbounds?
codeunit_check_length(str, i) = @inbounds return i <= sizeof(str) ? codeunit(str, i) : 0x0

Again, are you completely sure the R code actually works with the string contents rather than with pointers to the global string pool? Obviously any code will have the same problem with strings of different lengths, not just Julia.

I have been thinking about this. Even if they have global string pools they still need to establish the relative order of each string. So I don’t see how the global string pool will provide speed up. Also vanilla sort(strings) in R is also slow, only sort(strings, method="radix") is fast. So I think it’s got to do with fast access to underlying bytes.

I am trying a couple more ideas in Julia based on the discussions here. More to come…

Makes sense. Though they could do something clever by sorting the unique strings only once, and then sorting the full array based on that ordering. If that’s the case, the implementation will take more time when you increase the proportion of unique values.

The fact that the default sort is slow is a weak proof IMHO. R manages to get many things slow, and radix sort can be faster than other methods even if applied to pointers.

Yeah it’s not a solid proof but just conjecture, and given Julia doesn’t have global pools I have to try and push the performance as far as I can. For categorical arrays I have already an algorithm that was demonstrably faster than data.table so it’s not an issue.

2 Likes

One problem could be inlined (means unavoidable) checkbound in codeunit:

julia> @inbounds codeunit(svec1[1],8)
ERROR: BoundsError: attempt to access "i080648"
  at index [8]
Stacktrace:
 [1] checkbounds at ./strings/basic.jl:180 [inlined]
 [2] codeunit(::String, ::Int64) at ./strings/string.jl:74
 [3] top-level scope

So you probably don’t want to use codeunit for superfast solution!

Second problem (or solution) could be using (undocumented? unguaranteed?) internal structure of String:

julia> all(x->unsafe_load(pointer(x)+sizeof(x))==0x00, svec1)
true

So you don’t need to have codeunit_check_length if Char after last Char of String is equal to 0x00.

It could be true with UTF-8 too.

Although I am not sure if it is simple task to write algorithm which don’t violate unicode principles :stuck_out_tongue_winking_eye: of ordering (see: http://www.unicode.org/reports/tr10/ ).

If you need just binary comparison/ordering then maybe you could try to ccall strcmp too…

I think that’s just because @inbounds doesn’t work at the global scope (it needs inlining):

f() = @inbounds codeunit("i080648",18)
f() # No error
1 Like

Thanks! :slight_smile: Couldn’t it fire warning?

Using everyone’s ideas I got. In another thread I already learned about unsafe_load.(Ptr{UInt}.(pointer.(x))) to load 8 bytes at a time. But it doesn’t guarantee that the bytes be set to 0x00 where there isn’t a character, but I can do the below to bitshift away the redundant bytes, but the below is 3 times slower than the version without bitshift. Can the below be faster? Feels like C programming now, btw…

hhh(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8
hh(x) = hhh.(x)
@time hh(svec1) # 9 seconds

I only just learned about memory recently, would this unsafe method cause segfault? In that it will try to access memory that was not assigned to it? Or is segfault not even the right term?

Based on my testing below unsafe_load 8 bytes at a time is almost as fast as unsafe_load. But if I add the bit shift then it’s 3x slower. But still it’s an improvement over my original check length method already.

const M=100_000_000; const K=100
srand(1)
@time svec1 = rand([string(rand(Char.(1:255), rand(1:8))...) for k in 1:M÷K], M)

# loading the 8th byte
f(x) = unsafe_load.(pointer.(x) .+ 8)
@time f(svec1) # 3 seconds

# load all 8 bytes
hh(x) = unsafe_load.(Ptr{UInt}.(pointer.(x)))
@time hh(svec1) # 3 seconds

function codeunit_check_length(a_str, n)
    if n <= length(a_str)
        return codeunit(a_str,n)
    else
        return 0x00
    end
end

g(x) = codeunit_check_length.(x, 8)
@time g(svec1) # 19 seconds

Thought about another way which is to mask it with 2^sizeof(string)-1) which is of similiar speed to bitshift

bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8
rawload(x) = unsafe_load(Ptr{UInt}(pointer(x)))
masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^sizeof(x)-1)
using BenchmarkTools
@benchmark bitshift("abc") # 5-6 ns
@benchmark rawload("abc")  # 1.5ns
@benchmark masking("abc") # 7ns

My result for bitshift is radically better! (0.7 ; linux ; intel i5)

julia> @btime bitshift("abc") # 5-6 ns
  2.082 ns (0 allocations: 0 bytes)
0x0000000000636261

julia> @btime rawload("abc") #  1.5ns
  1.741 ns (0 allocations: 0 bytes)
0x00007f1800636261

julia> @btime masking("abc") # 7 ns
  7.394 ns (0 allocations: 0 bytes)
0x0000000000636261

PS. You need to change a little these functions. You get bitshift("abc") -> 0x0000000000000000 (8 outside of appropriate parenthesis) and masking("abc") -> 0x0000000000000001 (8 factor is missing)…

Maybe to this:

julia> bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << 8(8 - sizeof(x))) >> 8(8- sizeof(x));
julia> bitshift("abc")
0x0000000000636261

julia> masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^8sizeof(x)-1);
julia> masking("abc")
0x0000000000636261
1 Like

The faster string sort is up. Thanks to all for all the ideas and tips! The new string sort, radixsort!, is about 2~3 times than Base.sort!. It’s still slower than R but I guess it’s progress

# Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git")
using FastGroupBy

const M=10_000_000; const K=100
srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 8
@time radixsort!(svec1) # 3 seconds on 10m
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 8
@time sort!(svec1) # 7 seconds on 10m

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 16
@time radixsort!(svec1) # 4 seconds on 10m
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 16
@time sort!(svec1) # 8 seconds

I probably have to add that better performance is independent on my changes to function! It seems to be just other enverinment (maybe julia version, maybe OS, maybe HW)

Difference seems to be huge…