Is Vecotr{UInt8}(string) the fastest way to convert String to bytes? From my testing, it seems to perform ok
const M = 100_000_000; const K = 100
const s = rand([@sprintf "id%03d" k for k in 1:K], M)
function ghi(s::Vector{S}) where S<:AbstractString
prod.(Vector{UInt8}.(s))
end
@time ghi(s)
Background
I was looking into radixsort for strings for which there is an implementation in R. I can sort 100 million strings of varying length in 6 seconds in R. The sorting performance for strings in Julia is 38 seconds, see below
const M = 100_000_000; const K = 100
idstr = rand([@sprintf "id%03d" k for k in 1:K], M)
@time sort(idstr)
@time sort(idstr)
I think being able to sort vectors of String in a performant manner is key to unlocking data manipulation performance.
That’s why I wanted to understand what’s the most performant way to make the underlying bytes available.
Have you tried using codeunit instead of raw pointers? The codeunit function essentially does the same raw-pointer load, but in a safer way (it does a bounds check on the index, which can be turned off with @inbounds).
If you really needed fast concatenation into, say, UInt64, you could just convert to a pointer of that type and call unsafe_load:
julia> s = "foobarasdfkjas;f"
"foobarasdfkjas;f"
julia> p = Ptr{UInt64}(pointer(s))
Ptr{UInt64} @0x0000000111ddf468
julia> unsafe_load(p, 1)
0x73617261626f6f66
Obviously, you have to be extremely careful with raw pointer manipulation, but this is probably the fastest way to “concatenate” the bytes. There are lots of ways for this to break, though.
Note that this only works on String, not on AbstractString. (On AbstractString, you don’t know the encoding/representation.)
(Don’t use Base.pointerref, use the documented function unsafe_load if you must work with pointers.)
julia>
function roughhash(s::String)
n = sizeof(s)
if n >= 8
return unsafe_load(Ptr{UInt64}(pointer(s)))
else
h = zero(UInt64)
for i = 1:n
@inbounds h = (h << 8) | codeunit(s, i)
end
return h
end
end
julia> @btime roughhash($("asldfj;asdjfsa"))
3.473 ns (0 allocations: 0 bytes)
0x613b6a66646c7361
IIRC when I traced through the source, codeunit makes a call to pointer. So if I go one level lower I can save sizeof(s) calls to pointer? Or the compiler already optimises for that? I am not that confortable with reading lower level code yet to figure this out for myself.
I wonder if there’s a way to load all the bits into a bitarray even quicker? From the soure it mentions that BitArray is a contiguous block in memory. The >> operator is already defined so if there’s an efficient way to mask with & or convert the last few bits to UInt16 then all the ingredients for an efficient implementation of radixsort for strings are there.
I won’t have access to my computer for today, so I can’t try it now.
No, creating any kind of wrapper object will be slower. Besides, a bitarray doesn’t give you anything that accessing the raw bytes doesn’t, so I’m not sure what you’re hoping to gain here.
A string can be of any length. So if i load into UInt64 then I have to
manually manage it using more code for string longer than size 8 etc. But
if i can load into bitarray then I don’t have to manage so much lower level
detail that’s all.
@xiaodai, you mentioned sorting in R in your first post. If I remember right, R stores strings in a global lookup table, so if you are sorting repeat values, it’ll be much faster. In Julia, a Symbol works similarly to R’s strings, so you may want to try that.
See also PooledArrays for a data structure that should sort very quickly when there are many repeats:
You might be referring to string.as.factor which was the default for storing string loaded in using for example read.csv. In my testing example, they were just strings which is the default for the now de facto standards of loading in data including readr and data.table’s fread. So I think the comparison is fair. Anyway thanks for the help. I think there should be a good straight implementation of radixsort in Julia for strings anyway.
BTW this is the MWE in R where I can sort 100 million length 12 strings in 18.5 seconds.
@tshort is right, strings are stored using a global pool in R even when they are not factors. This is not visible to the user, but data.table probably uses that property to make sorting much more efficient. So the fair comparison is indeed with PooledArray and CategoricalArray (EDIT: or Symbol).
Well, this kind of array simply stores integer codes corresponding to each string, and so you can sort them just like you would sort integer columns in order to group them. Then another step can be needed to sort the result in a meaningful order if the codes are not already sorted (for example, CategoricalArray stores the order separately from the codes).
This is the kind of specialization DataFrames currently do not implement but definitely should.
Indeed that can be problematic for sorting, not sure what’s the best solution. But it’s not clear to me how UInt64 will help you here, given that a string length is not always a multiple of 8 bytes.
Thanks for all your help and suggestion! I got a faster string sorting algorithm (for 8 byte string only atm) implemented already! It’s as fast as R’s.
It’s because you are running on little-endian hardware. You can do ntoh(unsafe_load(Ptr{UInt64}(pointer(x)))) if you want the bytes to represent the digits in bigendian order (from most to least significant).