WIP: faster string sort

Makes sense. Though they could do something clever by sorting the unique strings only once, and then sorting the full array based on that ordering. If that’s the case, the implementation will take more time when you increase the proportion of unique values.

The fact that the default sort is slow is a weak proof IMHO. R manages to get many things slow, and radix sort can be faster than other methods even if applied to pointers.

Yeah it’s not a solid proof but just conjecture, and given Julia doesn’t have global pools I have to try and push the performance as far as I can. For categorical arrays I have already an algorithm that was demonstrably faster than data.table so it’s not an issue.

2 Likes

One problem could be inlined (means unavoidable) checkbound in codeunit:

julia> @inbounds codeunit(svec1[1],8)
ERROR: BoundsError: attempt to access "i080648"
  at index [8]
Stacktrace:
 [1] checkbounds at ./strings/basic.jl:180 [inlined]
 [2] codeunit(::String, ::Int64) at ./strings/string.jl:74
 [3] top-level scope

So you probably don’t want to use codeunit for superfast solution!

Second problem (or solution) could be using (undocumented? unguaranteed?) internal structure of String:

julia> all(x->unsafe_load(pointer(x)+sizeof(x))==0x00, svec1)
true

So you don’t need to have codeunit_check_length if Char after last Char of String is equal to 0x00.

It could be true with UTF-8 too.

Although I am not sure if it is simple task to write algorithm which don’t violate unicode principles :stuck_out_tongue_winking_eye: of ordering (see: UTS #10: Unicode Collation Algorithm ).

If you need just binary comparison/ordering then maybe you could try to ccall strcmp too…

I think that’s just because @inbounds doesn’t work at the global scope (it needs inlining):

f() = @inbounds codeunit("i080648",18)
f() # No error
1 Like

Thanks! :slight_smile: Couldn’t it fire warning?

Using everyone’s ideas I got. In another thread I already learned about unsafe_load.(Ptr{UInt}.(pointer.(x))) to load 8 bytes at a time. But it doesn’t guarantee that the bytes be set to 0x00 where there isn’t a character, but I can do the below to bitshift away the redundant bytes, but the below is 3 times slower than the version without bitshift. Can the below be faster? Feels like C programming now, btw…

hhh(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8
hh(x) = hhh.(x)
@time hh(svec1) # 9 seconds

I only just learned about memory recently, would this unsafe method cause segfault? In that it will try to access memory that was not assigned to it? Or is segfault not even the right term?

Based on my testing below unsafe_load 8 bytes at a time is almost as fast as unsafe_load. But if I add the bit shift then it’s 3x slower. But still it’s an improvement over my original check length method already.

const M=100_000_000; const K=100
srand(1)
@time svec1 = rand([string(rand(Char.(1:255), rand(1:8))...) for k in 1:M÷K], M)

# loading the 8th byte
f(x) = unsafe_load.(pointer.(x) .+ 8)
@time f(svec1) # 3 seconds

# load all 8 bytes
hh(x) = unsafe_load.(Ptr{UInt}.(pointer.(x)))
@time hh(svec1) # 3 seconds

function codeunit_check_length(a_str, n)
    if n <= length(a_str)
        return codeunit(a_str,n)
    else
        return 0x00
    end
end

g(x) = codeunit_check_length.(x, 8)
@time g(svec1) # 19 seconds

Thought about another way which is to mask it with 2^sizeof(string)-1) which is of similiar speed to bitshift

bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8
rawload(x) = unsafe_load(Ptr{UInt}(pointer(x)))
masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^sizeof(x)-1)
using BenchmarkTools
@benchmark bitshift("abc") # 5-6 ns
@benchmark rawload("abc")  # 1.5ns
@benchmark masking("abc") # 7ns

My result for bitshift is radically better! (0.7 ; linux ; intel i5)

julia> @btime bitshift("abc") # 5-6 ns
  2.082 ns (0 allocations: 0 bytes)
0x0000000000636261

julia> @btime rawload("abc") #  1.5ns
  1.741 ns (0 allocations: 0 bytes)
0x00007f1800636261

julia> @btime masking("abc") # 7 ns
  7.394 ns (0 allocations: 0 bytes)
0x0000000000636261

PS. You need to change a little these functions. You get bitshift("abc") -> 0x0000000000000000 (8 outside of appropriate parenthesis) and masking("abc") -> 0x0000000000000001 (8 factor is missing)…

Maybe to this:

julia> bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << 8(8 - sizeof(x))) >> 8(8- sizeof(x));
julia> bitshift("abc")
0x0000000000636261

julia> masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^8sizeof(x)-1);
julia> masking("abc")
0x0000000000636261
1 Like

The faster string sort is up. Thanks to all for all the ideas and tips! The new string sort, radixsort!, is about 2~3 times than Base.sort!. It’s still slower than R but I guess it’s progress

# Pkg.clone("https://github.com/xiaodaigh/FastGroupBy.jl.git")
using FastGroupBy

const M=10_000_000; const K=100
srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 8
@time radixsort!(svec1) # 3 seconds on 10m
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 8
@time sort!(svec1) # 7 seconds on 10m

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 16
@time radixsort!(svec1) # 4 seconds on 10m
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 16
@time sort!(svec1) # 8 seconds

I probably have to add that better performance is independent on my changes to function! It seems to be just other enverinment (maybe julia version, maybe OS, maybe HW)

Difference seems to be huge…

BTW sorting length 1-8 or length 1-16 string vectors of size 10 million in R takes 1.1 second and 1.29 seconds :astonished:

So my Julia implementation is not even close to R’s performance yet, and maybe it’s unfair because R has this global pool concept which I haven’t been able to find out much about.

I was also trying some research about Base.sort! and strcmp ccall…

Maybe you could find some improvements there too:

julia> strcmp(a::String, b::String) = ccall(:strcmp, Int32, (Cstring, Cstring), pointer(a), pointer(b))
strcmp (generic function with 1 method)

julia> @btime sort!(svec1, lt=(x,y)->strcmp(x,y)<0, alg=QuickSort)
  2.015 s (0 allocations: 0 bytes)

I have switched to VScode and I get faster timings than in Juno already.

The strcmp is quite good as it’s performance only degrades a little when comparing long strings where as the radixsort solution will be bounded by O(ln) where l is the (average?) length of the strings. For longer strings I can see strcmp outperforming radixsort. I tried sorting length 1-8 strings of size 100 million the radix sort took 26 seconds and strcmp took 56 seconds; so in generally radixsort should still be faster. Perhaps the fastest radixsort is to call into a C implementation. But it would be great to not have to resort to C, cos Julia is meant to “solve” the two language problem. Perhaps this is one area where we have to admit defeat? Or Julia will come back with a much more performant way to rearrange the elements of string vectors.

using FastGroupBy

const M=10_000_000; const K=100
srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 8
@time radixsort!(svec1); # 1.4 seconds
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
strcmp(a::String, b::String) = ccall(:strcmp, Int32, (Cstring, Cstring), pointer(a), pointer(b))
@time sort!(svec1, lt=(x,y)->strcmp(x,y)<0, alg=QuickSort); # 2.7 seconds
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 8
@time sort!(svec1); # 4 seconds on 10m
issorted(svec1)


srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using FastGroupBy.radixsort! to sort strings of length 8
@time radixsort!(svec1); # 2.3 seconds
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
strcmp(a::String, b::String) = ccall(:strcmp, Int32, (Cstring, Cstring), pointer(a), pointer(b))
@time sort!(svec1, lt=(x,y)->strcmp(x,y)<0, alg=QuickSort); # 2.7 seconds
issorted(svec1)

srand(1)
svec1 = rand([string(rand(Char.(32:126), rand(1:16))...) for k in 1:M÷K], M)
# using Base.sort! to sort strings of length 8
@time sort!(svec1); # 4.6 seconds on 10m
issorted(svec1)

I think you should define the target to reach if you want to assess how fast is your implementation. R is not a good reference as long as don’t know how it works exactly, and unless you can find a C implementation to compare to, it’s hard to know whether Julia can reach the same performance or not.

You don’t really want to use strcmp, that is pretty much a legacy function that should be avoided, that doesn’t handle \0 bytes inside strings.
What you really want is memcmp, and don’t use Cstring either, directly use Ptr{UInt8}.

srand(1)

# memcmp("abc","def",3)
memcmp(a::String, b::String, sz) = ccall(:memcmp, Int32, (Ptr{UInt8}, Ptr{UInt8}, Int32), pointer(a), pointer(b), sz)

M = 10_000_000; K = 100
svec1 = rand([string(rand(Char.(32:126), rand(1:8))...) for k in 1:M÷K], M)
@time sort!(svec1, lt=(x,y)->memcmp(x,y,8)<0, alg=QuickSort); # 2.7 seconds
issorted(svec1) # 2.5 seconds

still slower than the radixsort but looks reasonable.

IMHO in contrary it stop looking at memory after 0x00 is reached. Which is fine because we don’t to have memory access violation and it seems (although I agree it is probably not guaranted) that Julia add 0x00 after end of string. Memcpy solution has to find minimum of sizeofs which is not for free.

I agree that 0x00 could not be inside string (which will be problem or example with utf-16 encoding) but in UTF-8 and much other encodings it could be acceptable I think.

Does R support strings with 0x00 inside?

radixsort seems to bee good algorithm and we probably just need to find really quick function to calculate radix.

Could you test this conjecture and use some most primitive function for benchmark your algorithm? (maybe radix(s) = s[1] or maybe this one too: radix(s) = 1)

Maybe we could try to write something in C. (I don’t agree that it is losing war with 2 language problem - maybe just a battle :slight_smile: )

PS. Did you think about RadixSort as method which could be acceptable as alg parameter of sort!?

Most modern languages do accept strings with 0x00, even C does - it’s really just a convention, used by C literals and the str* family of functions, but having the length stored is also supported with the mem* functions, which can be faster than the str* ones using SIMD instructions, and don’t have the dangerous buffer overflow issues of using \0 termination.

Also, about why it was really a bad choice: http://queue.acm.org/detail.cfm?id=2010365

yeah radixsort could be an alg. It is already the case if you load SortingAgorithms.jl but the radixsort there doesn’t sort strings hence the work here.