WIP: faster string sort

Out of curiosity, have you looked at R’s source code? (note: it is under the GPL)

Also, on Julia 0.7 with a Ryzen processor (Julia launched with -O3):

julia> bitshift(x) = (unsafe_load(Ptr{UInt}(pointer(x))) << (8 - sizeof(x)))*8 >> (8- sizeof(x))*8

WARNING: deprecated syntax "call to `*` inside call to bitshift operator" at REPL[1]:1.
Use "parenthesized call to `*`" instead.

WARNING: deprecated syntax "call to `*` inside call to bitshift operator" at REPL[1]:1.
Use "parenthesized call to `*`" instead.
bitshift (generic function with 1 method)

julia> rawload(x) = unsafe_load(Ptr{UInt}(pointer(x)))
rawload (generic function with 1 method)

julia> masking(x) = unsafe_load(Ptr{UInt}(pointer(x))) & UInt(2^sizeof(x)-1)
masking (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark bitshift("abc") # 5-6 ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.233 ns (0.00% GC)
  mean time:        1.248 ns (0.00% GC)
  maximum time:     16.611 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rawload("abc")  # 1.5ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.249 ns (0.00% GC)
  maximum time:     58.880 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark masking("abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.231 ns (0.00% GC)
  maximum time:     4.890 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark bitshift("abcdefghijklmnopqrstuvwxyz") # 5-6 ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.232 ns (0.00% GC)
  mean time:        1.248 ns (0.00% GC)
  maximum time:     21.651 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark rawload("abcdefghijklmnopqrstuvwxyz")  # 1.5ns
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.222 ns (0.00% GC)
  median time:      1.223 ns (0.00% GC)
  mean time:        1.234 ns (0.00% GC)
  maximum time:     12.614 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark masking("abcdefghijklmnopqrstuvwxyz")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.169 ns (0.00% GC)
  median time:      5.180 ns (0.00% GC)
  mean time:        5.235 ns (0.00% GC)
  maximum time:     24.166 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> versioninfo()
Julia Version 0.7.0-DEV.2778
Commit 622ab91* (2017-12-06 21:59 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

This doesn’t mean they’re equally fast, just that they’ve hit the floor (about 1.2ns). So…

julia> @generated function do_f_x_times(f, ::Val{x}, args...) where x
           :(Base.Cartesian.@nexprs $x i -> f(args...) )
       end
               
do_f_x_times (generic function with 1 method)

julia> do_f_x_times(println, Val(3), "hi")
hi
hi
hi

julia> @benchmark do_f_x_times(bitshift, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.550 ns (0.00% GC)
  median time:      5.591 ns (0.00% GC)
  mean time:        5.755 ns (0.00% GC)
  maximum time:     55.595 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark do_f_x_times(rawload, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     7.200 ns (0.00% GC)
  median time:      7.873 ns (0.00% GC)
  mean time:        12.832 ns (31.45% GC)
  maximum time:     28.613 μs (99.96% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark do_f_x_times(masking, Val(10), "abc")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.875 ns (0.00% GC)
  median time:      29.885 ns (0.00% GC)
  mean time:        30.501 ns (0.00% GC)
  maximum time:     91.197 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995