# Is `Vector{UInt8}(string)` the fastest way to convert String to bytes?

#1

Is `Vecotr{UInt8}(string)` the fastest way to convert String to bytes? From my testing, it seems to perform ok

``````const M = 100_000_000; const K = 100
const s = rand([@sprintf "id%03d" k for k in 1:K], M)

function ghi(s::Vector{S}) where S<:AbstractString
prod.(Vector{UInt8}.(s))
end

@time ghi(s)
``````

Background

I was looking into radixsort for strings for which there is an implementation in R. I can sort 100 million strings of varying length in 6 seconds in R. The sorting performance for strings in Julia is 38 seconds, see below

``````const M = 100_000_000; const K = 100
idstr = rand([@sprintf "id%03d" k for k in 1:K], M)

@time sort(idstr)
@time sort(idstr)
``````

I think being able to sort vectors of String in a performant manner is key to unlocking data manipulation performance.

Thatâ€™s why I wanted to understand whatâ€™s the most performant way to make the underlying bytes available.

#2

You can call the codeunit(s,i) function to access the i-th byte of a String directly, without creating an Array object.

#3

Itâ€™s actually quite slow for long strings.

#4

I digged through the source code and found that this is probably the fastest way to get to the underlying byte.

``````s = "abc"
pp = pointer(s)
j = 1
Base.pointerref(pp, j, 1)
``````

Is there an efficient way to concatenate the types up all these bytes into like a `UInt128` type? I tried this approach

``````function roughhash(s::AbstractString)
pp = pointer(s)
sz = sizeof(s)
hh = zero(UInt128) | Base.pointerref(pp, 1, 1)
for j = 2:min(sz, 16)
hh = (hh << 8) | Base.pointerref(pp, j, 1)
end
hh
end
@benchmark roughhash("id0000248523")
``````

I think I need these numbers go to 1~3ns in order to build an efficient algorithm that can eventually get close to Râ€™s string sort implementation.

``````BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     12.034 ns (0.00% GC)
median time:      12.764 ns (0.00% GC)
mean time:        14.569 ns (0.00% GC)
maximum time:     503.977 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

#5

Have you tried using `codeunit` instead of raw pointers? The `codeunit` function essentially does the same raw-pointer load, but in a safer way (it does a bounds check on the index, which can be turned off with `@inbounds`).

If you really needed fast concatenation into, say, `UInt64`, you could just convert to a pointer of that type and call `unsafe_load`:

``````julia> s = "foobarasdfkjas;f"
"foobarasdfkjas;f"

julia> p = Ptr{UInt64}(pointer(s))
Ptr{UInt64} @0x0000000111ddf468

0x73617261626f6f66
``````

Obviously, you have to be extremely careful with raw pointer manipulation, but this is probably the fastest way to â€śconcatenateâ€ť the bytes. There are lots of ways for this to break, though.

Note that this only works on `String`, not on `AbstractString`. (On `AbstractString`, you donâ€™t know the encoding/representation.)

(Donâ€™t use `Base.pointerref`, use the documented function `unsafe_load` if you must work with pointers.)

#6
``````julia>
function roughhash(s::String)
n = sizeof(s)
if n >= 8
else
h = zero(UInt64)
for i = 1:n
@inbounds h = (h << 8) | codeunit(s, i)
end
return h
end
end

julia> @btime roughhash(\$("asldfj;asdjfsa"))
3.473 ns (0 allocations: 0 bytes)
0x613b6a66646c7361
``````

So, it is close to the 3ns you were hoping for.

#7

IIRC when I traced through the source, `codeunit` makes a call to `pointer`. So if I go one level lower I can save `sizeof(s)` calls to `pointer`? Or the compiler already optimises for that? I am not that confortable with reading lower level code yet to figure this out for myself.

#8

Looking at `@code_native`, it seems that the compiler hoists this out of the loop for you, at least with `@inbounds`.

#9

I wonder if thereâ€™s a way to load all the bits into a bitarray even quicker? From the soure it mentions that BitArray is a contiguous block in memory. The `>>` operator is already defined so if thereâ€™s an efficient way to mask with `&` or convert the last few bits to `UInt16` then all the ingredients for an efficient implementation of radixsort for strings are there.

I wonâ€™t have access to my computer for today, so I canâ€™t try it now.

#10

No, creating any kind of wrapper object will be slower. Besides, a bitarray doesnâ€™t give you anything that accessing the raw bytes doesnâ€™t, so Iâ€™m not sure what youâ€™re hoping to gain here.

#11

A string can be of any length. So if i load into UInt64 then I have to
manually manage it using more code for string longer than size 8 etc. But
if i can load into bitarray then I donâ€™t have to manage so much lower level
detail thatâ€™s all.

#12

@xiaodai, you mentioned sorting in R in your first post. If I remember right, R stores strings in a global lookup table, so if you are sorting repeat values, itâ€™ll be much faster. In Julia, a `Symbol` works similarly to Râ€™s strings, so you may want to try that.

See also PooledArrays for a data structure that should sort very quickly when there are many repeats:

#13

You might be referring to `string.as.factor` which was the default for storing `string` loaded in using for example `read.csv`. In my testing example, they were just strings which is the default for the now de facto standards of loading in data including readr and data.tableâ€™s `fread`. So I think the comparison is fair. Anyway thanks for the help. I think there should be a good straight implementation of radixsort in Julia for strings anyway.

BTW this is the MWE in R where I can sort 100 million length 12 strings in 18.5 seconds.

``````N=1e8; K=100
id3 = sample(sprintf("id%10d",1:(N/K)), N, TRUE)
pt = proc.time()
data.table::timetaken(pt) # 18.5 seconds
``````

The comparable MWE in Julia currently takes 90~100 seconds using the default sorting algorithm

``````const N = Int(1e8); const K  = 100
id3 = rand(["id"*dec(k,10) for k in 1:(NĂ·K)], N)

@time sort(id3)
@time sort(id3)
``````

#14

@tshort is right, strings are stored using a global pool in R even when they are not factors. This is not visible to the user, but data.table probably uses that property to make sorting much more efficient. So the fair comparison is indeed with `PooledArray` and `CategoricalArray` (EDIT: or `Symbol`).

#15

Ok. Then data.table seem slow thenâ€¦ My gut-feel tells me that there are faster algorithms for arrays that we know of thec categories in advance.

I need to find out more about R then and in particular that aspect of strings.

#16

``````x= "abcdefgh"
y ="bbcdefgh"
skipbytes=0
T=UInt
``````

notice how strings `x` and `y` differ in the first letter but the outputs differ in the last 2 digits? Is this a bug or expected?

#17

Well, this kind of array simply stores integer codes corresponding to each string, and so you can sort them just like you would sort integer columns in order to group them. Then another step can be needed to sort the result in a meaningful order if the codes are not already sorted (for example, `CategoricalArray` stores the order separately from the codes).

This is the kind of specialization DataFrames currently do not implement but definitely should.

``````julia> reinterpret(UInt32, Vector{UInt8}(x))
2-element Array{UInt32,1}:
0x64636261
0x68676665

julia> reinterpret(UInt64, Vector{UInt8}(x))
1-element Array{UInt64,1}:
0x6867666564636261

``````

Indeed that can be problematic for sorting, not sure whatâ€™s the best solution. But itâ€™s not clear to me how `UInt64` will help you here, given that a string length is not always a multiple of 8 bytes.

#18

If you know the possible categories in advance you can use bucket sort which is very efficient.

Canâ€™t agree more! I am willing to lend a hand and make it happen.

An algorithm can load bytes shorter than 8 into UInt64, then just sort the string 8 bytes at a time using a modified radix sort algorithm.

#19

Thanks for all your help and suggestion! I got a faster string sorting algorithm (for 8 byte string only atm) implemented already! Itâ€™s as fast as Râ€™s.

#20

Itâ€™s because you are running on little-endian hardware. You can do `ntoh(unsafe_load(Ptr{UInt64}(pointer(x))))` if you want the bytes to represent the digits in bigendian order (from most to least significant).