WIP: faster string sort

Also noticed that it’s quite expensive to assign a string to a new location in another array .e.g

a = ["abc","bdfd","cdgdd"]
b = similiar(a)
b[2] = a[1]

is more expensive than

a = [1:3...]
b = similiar(a)
b[2] = a[1]

perhaps WeakRefStrings can help and also it might be faster to sort an index rather than the string vector itself. Will investigate later.

There is already a minor boost in speed now that the algorithm sorts two bytes at a time instead of one byte.

By the way what’s the most efficient way to swap around the last 1-8 bits with the 9-16 bits so that the bit order is now 9-16, 1-8, instead of 1-8, 9-16 e.g.

 function pre_getidx(vsi, j)
        (((vsi >> (j*2-1)*8) & 0xff)) | (((vsi >> (j*2-2)*8) & 0xff) << 8)
    end

    function getidx(vsi, j)
        Int(pre_getidx(vsi,j)) + 1
    end

getidx(0x00006162)
# yields 0x00006261

which is what I want but wondering if the above bitshifting approach is the fastest?

The background is that unsafe_load(Ptr{UInt}) can load the underlying bytes reasonably quickly but because I am on a little-endian platform the bytes come out back to front. So if I want to sort by two bytes at once using radixsort I need to “reverse” the bytes as above.

Out of curiosity, have you looked at R’s source code? (note: it is under the GPL)

Also, on Julia 0.7 with a Ryzen processor (Julia launched with -O3):

julia> bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8

WARNING: deprecated syntax "call to `*` inside call to bitshift operator" at REPL[1]:1.
Use "parenthesized call to `*`" instead.

WARNING: deprecated syntax "call to `*` inside call to bitshift operator" at REPL[1]:1.
Use "parenthesized call to `*`" instead.
bitshift (generic function with 1 method)

julia> rawload(x) = unsafe_load(Ptr{UInt}(pointer(x)))
rawload (generic function with 1 method)

julia> masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^sizeof(x)-1)
masking (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark bitshift("abc") # 5-6 ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.233 ns (0.00% GC)
  mean time:        1.248 ns (0.00% GC)
  maximum time:     16.611 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rawload("abc")  # 1.5ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.249 ns (0.00% GC)
  maximum time:     58.880 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark masking("abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.231 ns (0.00% GC)
  maximum time:     4.890 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark bitshift("abcdefghijklmnopqrstuvwxyz") # 5-6 ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.248 ns (0.00% GC)
  maximum time:     21.651 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rawload("abcdefghijklmnopqrstuvwxyz")  # 1.5ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.223 ns (0.00% GC)
  mean time:        1.234 ns (0.00% GC)
  maximum time:     12.614 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark masking("abcdefghijklmnopqrstuvwxyz")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.169 ns (0.00% GC)
  median time:      5.180 ns (0.00% GC)
  mean time:        5.235 ns (0.00% GC)
  maximum time:     24.166 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> versioninfo()
Julia Version 0.7.0-DEV.2778
Commit 622ab91* (2017-12-06 21:59 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

This doesn’t mean they’re equally fast, just that they’ve hit the floor (about 1.2ns). So…

julia> @generated function do_f_x_times(f, ::Val{x}, args...) where x
           :(Base.Cartesian.@nexprs $x i -> f(args...) )
       end
               
do_f_x_times (generic function with 1 method)

julia> do_f_x_times(println, Val(3), "hi")
hi
hi
hi

julia> @benchmark do_f_x_times(bitshift, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.550 ns (0.00% GC)
  median time:      5.591 ns (0.00% GC)
  mean time:        5.755 ns (0.00% GC)
  maximum time:     55.595 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark do_f_x_times(rawload, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     7.200 ns (0.00% GC)
  median time:      7.873 ns (0.00% GC)
  mean time:        12.832 ns (31.45% GC)
  maximum time:     28.613 μs (99.96% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark do_f_x_times(masking, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.875 ns (0.00% GC)
  median time:      29.885 ns (0.00% GC)
  mean time:        30.501 ns (0.00% GC)
  maximum time:     91.197 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

Maybe this could be an inspiration: http://www.nik.no/2011/1-4-maus11ParallelRadixSorting

1 Like

I did a quick implementation of MSD string radixsort in C (btw most C radixsort found on google were “joke” implementations mostly) and it was slow compared to the implementation in Julia. So not very useful as a benchmark since it’s so unoptimised. I also found this lecture which contains some useful algorithms to try.

Here is how I am calling the C radixsort in Julia


cradixsort(x) = ccall(
    (:radix_sort,"libuntitled3"),
     Void, 
     (Ptr{Ptr{UInt8}},UInt),
    pointer(x), length(x)
)

const M = 100_000_000; const K = 100;
srand(1);
svec = rand(["id"*dec(k,10) for k in 1:M÷K], M);

# yy = svec[1:2_000_000];
@time cradixsort(svec) # 2.5 minutes
issorted(svec)

using FastGroupBy

@time radixsort!(svec) # 18 seconds
issorted(svec)

R’s Internals documentation discusses how R stores the strings and they do use some kind of “cache/look up” table for their strings; probably to save on RAM usage. So it’s not really fair to compare to the R implementation yet.

Two potential advantages are lower RAM usage, and very fast comparison because strcmp becomes a pointer comparison rather than character-by-character. The trade-off is that every string construction then requires a hash table look up.

The computer science name for this technique is string interning. Julia’s Symbol type is interned like that, and I believe I’ve seen posts about (ab)using it for general string interning – but I’m not sure if such a use-case is intended :smile:.
See also: [ANN] InternedStrings.jl: Allocate strings once and reuse them

Use bswap(value), it usually ends up being a single instruction (rolw, bswapl, bswapq)

there is also ntoh for endiness conversion. i will test them out

Those are just synonyms for bswap, such that if you use hton, on a little-endian host, it will convert it to network order (i.e. bigendian), but on a big-endian host, it’s a no-op.
You should use hton not ntoh though, just to keep the semantics correct, even though functions end up being the same on all current machines.

It will be really good if I can get some help with the following posts. It will help me create a faster pure Julia string sorting algorithm

Assign values in user-defined primitive type

bswap_int for user-defined primitive type

I have made some improvements to string sorting and have made a PR to SortingAlgorithms.jl

See

https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/27

2 Likes

I will take a rest on string sorting now and focus on string-grouping instead.

2 Likes

First draft of the blog post about string sorting

https://www.codementor.io/zhuojiadai/draft/f57tf9f8s

3 Likes

Great!

My two cents: I’d mention the fact that R uses string interning directly when presenting the results. People may not read the full post, or may just skip it, in which case they will not understand the full story. Also, interning strings requires an additional computation step that is not accounted for, so this is not a completely fair comparison. For example, if you computed the time required to read a CSV file and sort it, R’s advantage would probably be less striking.

When you mention InternedStrings.jl, you could also point to PooledArrays and CategoricalArrays, which can be useful alternatives when you have a small proportion of unique values. So while it may seem at first that Julia is lacking compared to R because it does not intern strings, you have actually three different choices depending on your needs in Julia. OTC in R you cannot opt out of string interning if it’s not useful for your use case.

7 Likes

You can also use Julia’s Symbol type if you need to have something interned (although from what I understand, those will never get garbage collected).

1 Like

Great point. Added.

Added a side note. Also R and pands also has categorical arrays.

I am sure some useRs will start picking a fight about Julia package load speeds and do a comparison. Fun to add though.

I may be missing something about the exercise, but if I have N unique strings among K \gg N values, what’s the point of sorting the K values? Wouldn’t it be better to just sort the N unique values?

Also, if you are thinking about interning and want repeated access to the sorted values, consider sorted containers.

My impression is that you are focusing on low-level optimizations, which is occasionally useful, but rarely gives more than 2-5x improvements in an already optimized language like Julia. Thinking about the broad algorithm usually has higher payoffs; @ChrisRackauckas recently wrote a nice blog post about this.

2 Likes

I’d happy with 2x-5x performance improvement for sorting! The cover photo shows radix sort is faster than what Julia has in Base. I used to test cases were 100m records and R can sort them in 20sec but used to take 60 secs in Julia. Now it take 20~30 seconds in some cases. Julia can appear slow at data manipulation tasks (vs R’s data.table) so this is just one of many steps towards better performance.

Also O(NlogN) is a bound for comparison based sorting and O(wN) is the sorting speed bound for radixsort where w is the number of passes (6 for 64bit data given SortingAlgorithm sorts 11bits at once), and is one of the best known non-comparison based sorts. Happy to find out algorithms that can beat this and get 5x+ performance increase.

For radix sort you have to focus on low-level. I did mention high level optimization that might be possible after nalimilan’s comments though; also I have touched on similar theme’s to Chris’ posts e.g. this.

1 Like