Is `Vector{UInt8}(string)` the fastest way to convert String to bytes?

xiaodai · December 9, 2017, 3:58am

Is Vecotr{UInt8}(string) the fastest way to convert String to bytes? From my testing, it seems to perform ok

const M = 100_000_000; const K = 100
const s = rand([@sprintf "id%03d" k for k in 1:K], M)

function ghi(s::Vector{S}) where S<:AbstractString
    prod.(Vector{UInt8}.(s))
end

@time ghi(s)

Background

I was looking into radixsort for strings for which there is an implementation in R. I can sort 100 million strings of varying length in 6 seconds in R. The sorting performance for strings in Julia is 38 seconds, see below

const M = 100_000_000; const K = 100
idstr = rand([@sprintf "id%03d" k for k in 1:K], M)

@time sort(idstr)
@time sort(idstr)

I think being able to sort vectors of String in a performant manner is key to unlocking data manipulation performance.

That’s why I wanted to understand what’s the most performant way to make the underlying bytes available.

stevengj · December 9, 2017, 4:31am

You can call the codeunit(s,i) function to access the i-th byte of a String directly, without creating an Array object.

xiaodai · December 9, 2017, 6:36am

It’s actually quite slow for long strings.

xiaodai · December 9, 2017, 10:38am

I digged through the source code and found that this is probably the fastest way to get to the underlying byte.

s = "abc"
pp = pointer(s)
j = 1
Base.pointerref(pp, j, 1)

Is there an efficient way to concatenate the types up all these bytes into like a UInt128 type? I tried this approach

function roughhash(s::AbstractString)
    pp = pointer(s)
    sz = sizeof(s)
    hh = zero(UInt128) | Base.pointerref(pp, 1, 1)
    for j = 2:min(sz, 16)
        hh = (hh << 8) | Base.pointerref(pp, j, 1)
    end
    hh
end
@benchmark roughhash("id0000248523")

I think I need these numbers go to 1~3ns in order to build an efficient algorithm that can eventually get close to R’s string sort implementation.

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.034 ns (0.00% GC)
  median time:      12.764 ns (0.00% GC)
  mean time:        14.569 ns (0.00% GC)
  maximum time:     503.977 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

stevengj · December 9, 2017, 1:51pm

Have you tried using codeunit instead of raw pointers? The codeunit function essentially does the same raw-pointer load, but in a safer way (it does a bounds check on the index, which can be turned off with @inbounds).

If you really needed fast concatenation into, say, UInt64, you could just convert to a pointer of that type and call unsafe_load:

julia> s = "foobarasdfkjas;f"
"foobarasdfkjas;f"

julia> p = Ptr{UInt64}(pointer(s))
Ptr{UInt64} @0x0000000111ddf468

julia> unsafe_load(p, 1)
0x73617261626f6f66

Obviously, you have to be extremely careful with raw pointer manipulation, but this is probably the fastest way to “concatenate” the bytes. There are lots of ways for this to break, though.

Note that this only works on String, not on AbstractString. (On AbstractString, you don’t know the encoding/representation.)

(Don’t use Base.pointerref, use the documented function unsafe_load if you must work with pointers.)

stevengj · December 9, 2017, 2:11pm

julia>
  function roughhash(s::String)
       n = sizeof(s)
       if n >= 8
          return unsafe_load(Ptr{UInt64}(pointer(s)))
       else
          h = zero(UInt64)
          for i = 1:n
            @inbounds h = (h << 8) | codeunit(s, i)
          end
          return h
       end
  end

julia> @btime roughhash($("asldfj;asdjfsa"))
  3.473 ns (0 allocations: 0 bytes)
0x613b6a66646c7361

So, it is close to the 3ns you were hoping for.

xiaodai · December 9, 2017, 9:15pm

IIRC when I traced through the source, codeunit makes a call to pointer. So if I go one level lower I can save sizeof(s) calls to pointer? Or the compiler already optimises for that? I am not that confortable with reading lower level code yet to figure this out for myself.

stevengj · December 9, 2017, 9:53pm

Looking at @code_native, it seems that the compiler hoists this out of the loop for you, at least with @inbounds.

xiaodai · December 9, 2017, 10:19pm

I wonder if there’s a way to load all the bits into a bitarray even quicker? From the soure it mentions that BitArray is a contiguous block in memory. The >> operator is already defined so if there’s an efficient way to mask with & or convert the last few bits to UInt16 then all the ingredients for an efficient implementation of radixsort for strings are there.

I won’t have access to my computer for today, so I can’t try it now.

stevengj · December 9, 2017, 10:24pm

No, creating any kind of wrapper object will be slower. Besides, a bitarray doesn’t give you anything that accessing the raw bytes doesn’t, so I’m not sure what you’re hoping to gain here.

xiaodai · December 9, 2017, 10:57pm

A string can be of any length. So if i load into UInt64 then I have to
manually manage it using more code for string longer than size 8 etc. But
if i can load into bitarray then I don’t have to manage so much lower level
detail that’s all.

tshort · December 10, 2017, 3:42am

@xiaodai, you mentioned sorting in R in your first post. If I remember right, R stores strings in a global lookup table, so if you are sorting repeat values, it’ll be much faster. In Julia, a Symbol works similarly to R’s strings, so you may want to try that.

See also PooledArrays for a data structure that should sort very quickly when there are many repeats:

https://github.com/JuliaComputing/PooledArrays.jl

xiaodai · December 10, 2017, 8:17am

You might be referring to string.as.factor which was the default for storing string loaded in using for example read.csv. In my testing example, they were just strings which is the default for the now de facto standards of loading in data including readr and data.table’s fread. So I think the comparison is fair. Anyway thanks for the help. I think there should be a good straight implementation of radixsort in Julia for strings anyway.

BTW this is the MWE in R where I can sort 100 million length 12 strings in 18.5 seconds.

N=1e8; K=100
id3 = sample(sprintf("id%10d",1:(N/K)), N, TRUE)
pt = proc.time()
system.time(sort(id3, method="radix"))
data.table::timetaken(pt) # 18.5 seconds

The comparable MWE in Julia currently takes 90~100 seconds using the default sorting algorithm

const N = Int(1e8); const K  = 100
id3 = rand(["id"*dec(k,10) for k in 1:(N÷K)], N)

@time sort(id3)
@time sort(id3)

nalimilan · December 10, 2017, 10:10am

@tshort is right, strings are stored using a global pool in R even when they are not factors. This is not visible to the user, but data.table probably uses that property to make sorting much more efficient. So the fair comparison is indeed with PooledArray and CategoricalArray (EDIT: or Symbol).

xiaodai · December 10, 2017, 10:14am

Ok. Then data.table seem slow then… My gut-feel tells me that there are faster algorithms for arrays that we know of thec categories in advance.

I need to find out more about R then and in particular that aspect of strings.

xiaodai · December 10, 2017, 10:48am

This seems to be loading the bytes back to front. E.g

x= "abcdefgh"
y ="bbcdefgh"
skipbytes=0
T=UInt
unsafe_load(Ptr{UInt64}(pointer(x))) #0x6867666564636261
unsafe_load(Ptr{UInt64}(pointer(y))) #0x6867666564636262

notice how strings x and y differ in the first letter but the outputs differ in the last 2 digits? Is this a bug or expected?

nalimilan · December 10, 2017, 11:15am

Well, this kind of array simply stores integer codes corresponding to each string, and so you can sort them just like you would sort integer columns in order to group them. Then another step can be needed to sort the result in a meaningful order if the codes are not already sorted (for example, CategoricalArray stores the order separately from the codes).

This is the kind of specialization DataFrames currently do not implement but definitely should.

That’s due to endianness. See also this:

julia> reinterpret(UInt32, Vector{UInt8}(x))
2-element Array{UInt32,1}:
 0x64636261
 0x68676665

julia> reinterpret(UInt64, Vector{UInt8}(x))
1-element Array{UInt64,1}:
 0x6867666564636261

Indeed that can be problematic for sorting, not sure what’s the best solution. But it’s not clear to me how UInt64 will help you here, given that a string length is not always a multiple of 8 bytes.

xiaodai · December 10, 2017, 11:28am

If you know the possible categories in advance you can use bucket sort which is very efficient.

Can’t agree more! I am willing to lend a hand and make it happen.

An algorithm can load bytes shorter than 8 into UInt64, then just sort the string 8 bytes at a time using a modified radix sort algorithm.

xiaodai · December 10, 2017, 11:55am

Thanks for all your help and suggestion! I got a faster string sorting algorithm (for 8 byte string only atm) implemented already! It’s as fast as R’s.

See WIP: faster string sort

stevengj · December 10, 2017, 12:24pm

It’s because you are running on little-endian hardware. You can do ntoh(unsafe_load(Ptr{UInt64}(pointer(x)))) if you want the bytes to represent the digits in bigendian order (from most to least significant).

Topic		Replies	Views
WIP: faster string sort Internals & Design strings , sort	92	12416	February 8, 2018
Working with ASCII strings General Usage question	6	934	September 25, 2017
What's the fastest to convert a String to an `Vector{T}` where `isbits(T) == true` and back again? Performance	2	474	September 6, 2019
Progress towards faster `sortperm` for Strings Data performance , sortperm	16	4645	January 29, 2018
Converting UInt8 Array to BigInt, and back New to Julia	19	3459	July 12, 2019

Is `Vector{UInt8}(string)` the fastest way to convert String to bytes?

Related topics