Is `Vector{UInt8}(string)` the fastest way to convert String to bytes?

Is Vecotr{UInt8}(string) the fastest way to convert String to bytes? From my testing, it seems to perform ok

const M = 100_000_000; const K = 100
const s = rand([@sprintf "id%03d" k for k in 1:K], M)

function ghi(s::Vector{S}) where S<:AbstractString
    prod.(Vector{UInt8}.(s))
end

@time ghi(s)

Background

I was looking into radixsort for strings for which there is an implementation in R. I can sort 100 million strings of varying length in 6 seconds in R. The sorting performance for strings in Julia is 38 seconds, see below

const M = 100_000_000; const K = 100
idstr = rand([@sprintf "id%03d" k for k in 1:K], M)

@time sort(idstr)
@time sort(idstr)

I think being able to sort vectors of String in a performant manner is key to unlocking data manipulation performance.

That’s why I wanted to understand what’s the most performant way to make the underlying bytes available.

1 Like

You can call the codeunit(s,i) function to access the i-th byte of a String directly, without creating an Array object.

1 Like

It’s actually quite slow for long strings.

I digged through the source code and found that this is probably the fastest way to get to the underlying byte.

s = "abc"
pp = pointer(s)
j = 1
Base.pointerref(pp, j, 1)

Is there an efficient way to concatenate the types up all these bytes into like a UInt128 type? I tried this approach

function roughhash(s::AbstractString)
    pp = pointer(s)
    sz = sizeof(s)
    hh = zero(UInt128) | Base.pointerref(pp, 1, 1)
    for j = 2:min(sz, 16)
        hh = (hh << 8) | Base.pointerref(pp, j, 1)
    end
    hh
end
@benchmark roughhash("id0000248523")

I think I need these numbers go to 1~3ns in order to build an efficient algorithm that can eventually get close to R’s string sort implementation.

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.034 ns (0.00% GC)
  median time:      12.764 ns (0.00% GC)
  mean time:        14.569 ns (0.00% GC)
  maximum time:     503.977 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Have you tried using codeunit instead of raw pointers? The codeunit function essentially does the same raw-pointer load, but in a safer way (it does a bounds check on the index, which can be turned off with @inbounds).

If you really needed fast concatenation into, say, UInt64, you could just convert to a pointer of that type and call unsafe_load:

julia> s = "foobarasdfkjas;f"
"foobarasdfkjas;f"

julia> p = Ptr{UInt64}(pointer(s))
Ptr{UInt64} @0x0000000111ddf468

julia> unsafe_load(p, 1)
0x73617261626f6f66

Obviously, you have to be extremely careful with raw pointer manipulation, but this is probably the fastest way to “concatenate” the bytes. There are lots of ways for this to break, though.

Note that this only works on String, not on AbstractString. (On AbstractString, you don’t know the encoding/representation.)

(Don’t use Base.pointerref, use the documented function unsafe_load if you must work with pointers.)

julia>
  function roughhash(s::String)
       n = sizeof(s)
       if n >= 8
          return unsafe_load(Ptr{UInt64}(pointer(s)))
       else
          h = zero(UInt64)
          for i = 1:n
            @inbounds h = (h << 8) | codeunit(s, i)
          end
          return h
       end
  end

julia> @btime roughhash($("asldfj;asdjfsa"))
  3.473 ns (0 allocations: 0 bytes)
0x613b6a66646c7361

So, it is close to the 3ns you were hoping for.

5 Likes

IIRC when I traced through the source, codeunit makes a call to pointer. So if I go one level lower I can save sizeof(s) calls to pointer? Or the compiler already optimises for that? I am not that confortable with reading lower level code yet to figure this out for myself.

Looking at @code_native, it seems that the compiler hoists this out of the loop for you, at least with @inbounds.

I wonder if there’s a way to load all the bits into a bitarray even quicker? From the soure it mentions that BitArray is a contiguous block in memory. The >> operator is already defined so if there’s an efficient way to mask with & or convert the last few bits to UInt16 then all the ingredients for an efficient implementation of radixsort for strings are there.

I won’t have access to my computer for today, so I can’t try it now.

No, creating any kind of wrapper object will be slower. Besides, a bitarray doesn’t give you anything that accessing the raw bytes doesn’t, so I’m not sure what you’re hoping to gain here.

A string can be of any length. So if i load into UInt64 then I have to
manually manage it using more code for string longer than size 8 etc. But
if i can load into bitarray then I don’t have to manage so much lower level
detail that’s all.

@xiaodai, you mentioned sorting in R in your first post. If I remember right, R stores strings in a global lookup table, so if you are sorting repeat values, it’ll be much faster. In Julia, a Symbol works similarly to R’s strings, so you may want to try that.

See also PooledArrays for a data structure that should sort very quickly when there are many repeats:

https://github.com/JuliaComputing/PooledArrays.jl

You might be referring to string.as.factor which was the default for storing string loaded in using for example read.csv. In my testing example, they were just strings which is the default for the now de facto standards of loading in data including readr and data.table’s fread. So I think the comparison is fair. Anyway thanks for the help. I think there should be a good straight implementation of radixsort in Julia for strings anyway.

BTW this is the MWE in R where I can sort 100 million length 12 strings in 18.5 seconds.

N=1e8; K=100
id3 = sample(sprintf("id%10d",1:(N/K)), N, TRUE)
pt = proc.time()
system.time(sort(id3, method="radix"))
data.table::timetaken(pt) # 18.5 seconds 

The comparable MWE in Julia currently takes 90~100 seconds using the default sorting algorithm

const N = Int(1e8); const K  = 100
id3 = rand(["id"*dec(k,10) for k in 1:(NĂ·K)], N)

@time sort(id3)
@time sort(id3)

@tshort is right, strings are stored using a global pool in R even when they are not factors. This is not visible to the user, but data.table probably uses that property to make sorting much more efficient. So the fair comparison is indeed with PooledArray and CategoricalArray (EDIT: or Symbol).

Ok. Then data.table seem slow then… My gut-feel tells me that there are faster algorithms for arrays that we know of thec categories in advance.

I need to find out more about R then and in particular that aspect of strings.

This seems to be loading the bytes back to front. E.g

x= "abcdefgh"
y ="bbcdefgh"
skipbytes=0
T=UInt
unsafe_load(Ptr{UInt64}(pointer(x))) #0x6867666564636261
unsafe_load(Ptr{UInt64}(pointer(y))) #0x6867666564636262

notice how strings x and y differ in the first letter but the outputs differ in the last 2 digits? Is this a bug or expected?

Well, this kind of array simply stores integer codes corresponding to each string, and so you can sort them just like you would sort integer columns in order to group them. Then another step can be needed to sort the result in a meaningful order if the codes are not already sorted (for example, CategoricalArray stores the order separately from the codes).

This is the kind of specialization DataFrames currently do not implement but definitely should.

That’s due to endianness. See also this:

julia> reinterpret(UInt32, Vector{UInt8}(x))
2-element Array{UInt32,1}:
 0x64636261
 0x68676665

julia> reinterpret(UInt64, Vector{UInt8}(x))
1-element Array{UInt64,1}:
 0x6867666564636261

Indeed that can be problematic for sorting, not sure what’s the best solution. But it’s not clear to me how UInt64 will help you here, given that a string length is not always a multiple of 8 bytes.

If you know the possible categories in advance you can use bucket sort which is very efficient.

Can’t agree more! I am willing to lend a hand and make it happen.

An algorithm can load bytes shorter than 8 into UInt64, then just sort the string 8 bytes at a time using a modified radix sort algorithm.

Thanks for all your help and suggestion! I got a faster string sorting algorithm (for 8 byte string only atm) implemented already! It’s as fast as R’s.

See WIP: faster string sort

It’s because you are running on little-endian hardware. You can do ntoh(unsafe_load(Ptr{UInt64}(pointer(x)))) if you want the bytes to represent the digits in bigendian order (from most to least significant).