Convert String to Byte Array using two bytes per character

Ronis_BR · May 3, 2019, 6:43pm

Hi guys!

I need to convert a string to a byte array, but using always two bytes per character. For example, in Julia, I have:

julia> Vector{UInt8}("Teste")
5-element Array{UInt8,1}:
 0x54
 0x65
 0x73
 0x74
 0x65

However, in C++ (using QString), I get the following sequence of bytes for the same string:

00 74 00 65 00 73 00 74 00 65

Is there an easy way to do this in Julia?

longemen3000 · May 3, 2019, 7:48pm

for what appears in the documentation, the limit is the UInt8 value 0x80, after that, it generates two bytes per string, look at this for example:

julia> Vector{UInt8}("Teste±")
7-element Array{UInt8,1}:
 0x54
 0x65
 0x73
 0x74
 0x65
 0xc2
 0xb1

Sukera · May 3, 2019, 8:47pm

Just to expand on this, it’s because in Julia, strings are internally encoded as UTF-8, which is a variable length encoding, meaning each character is encoded by a variable number of bytes. Unicode code points larger than 0x80 are the first to be encoded using two bytes, so that’s why only those show up with two bytes.

I’m not familiar with neither C++ nor QString, but my best bet for why those nullbytes get inserted there is because the representation chosen there is actually multiple null-terminated single character strings, which may or may not have multiple bytes themselves. If my hunch is correct, testing the string given by @longemen3000 should add 00 c2 b1 to your output.

stevengj · May 3, 2019, 9:15pm

That’s because QString is using the UTF-16 encoding of Unicode, whereas Julia uses UTF-8. UTF-16 is two bytes for most characters, but for some characters it is 4 bytes; wrongly assuming it is 2 bytes for every character is a common source of subtle bugs. (A 16-bit QChar is one “code unit” but is not one Unicode codepoint in general.)

You can convert a Julia string to native-endian UTF-16 using transcode(UInt16, somestring)

Ronis_BR · May 3, 2019, 10:05pm

Thanks @stevengj!

Ronis_BR · May 4, 2019, 1:57pm

With the suggestion of @stevengj, and just for record, the following function can be used to encode a Julia string into something that can be converted to QString very easy using QDataStream:

"""
    function convert_to_byte_array(x)

Convert the data `x` into an array of bytes.

"""
function convert_to_byte_array(x::T) where T<:Union{Float16,Float32,Float64,
                                                    Signed,Unsigned,Bool}
    iob = IOBuffer()
    write(iob, x)
    seekstart(iob)
    return read(iob)
end

function convert_to_byte_array(x::T) where T<:Union{Vector{Float16},
                                                    Vector{Float32},
                                                    Vector{Float64},
                                                    Vector{Int8},
                                                    Vector{Int16},
                                                    Vector{Int32},
                                                    Vector{Int64},
                                                    Vector{UInt8},
                                                    Vector{UInt16},
                                                    Vector{UInt32},
                                                    Vector{UInt64},
                                                    Vector{Bool}}
    num_elems   = length(x)
    byte_arrays = Vector{Vector{UInt8}}(undef, num_elems)
    num_bytes   = 0

    # Convert each element to a byte array.
    @inbounds for i = 1:num_elems
        ba              = convert_to_byte_array(x[i])
        num_bytes      += length(ba)
        byte_arrays[i]  = ba
    end

    # This is similat to `vcat(byte_arrays...)`, but more fast.
    byte_array = Vector{UInt8}(undef, num_bytes)
    ind        = 1

    @inbounds for i = 1:num_elems
        ba = byte_arrays[i]

        for j = 1:length(ba)
            byte_array[ind] = ba[j]
            ind += 1
        end
    end

    return byte_array
end

function convert_to_byte_array(str::String)
    # Here, we will use the same format as `QString`, which is:
    #
    #  |-----   UInt32   -----|
    #  | Number of characters | String encoded using UTF-16 |

    str_utf16 = transcode(UInt16, str)
    iob = IOBuffer()
    write(iob, str_utf16)
    seekstart(iob)
    str_encoded = read(iob)
    str_size = UInt32(length(str_encoded))

    return vcat(convert_to_byte_array(str_size),str_encoded)
end

If this is transmitted through TCP to a Qt software, you will just need to do the following to read the string:

QString str;
ds >> str;

in which ds is a QDataStream. Just remember to set to little endian!

stevengj · May 4, 2019, 5:41pm

First, if you want to convert to a byte array, you can just do reinterpret(UInt8, transcode(UInt16, str)).

Second, I’m not sure why you need to convert it to a byte array. You can use write to directly output a UInt16 array to a file or other stream, for example.

Ronis_BR · May 4, 2019, 7:10pm

Good! I was using this code since like 0.3 or 0.4 and that was the only way to do this by that time, IIRC. I will change to your suggestion since it is way better.

Because I need to merge with some other things to create the protocol (checksum, headers, configuration bits, etc.).

stevengj · May 4, 2019, 7:36pm

You still don’t need to convert things individually to byte arrays first. Just write each part of the protocol in sequence to a given stream (an IOBuffer if you want to get the whole message as a buffer/array). Don’t convert to separate byte arrays and concatenate them afterwards.

Note also that QString has a method to construct it from UTF-8, so you could also do the conversion on the other end.

Ronis_BR · May 4, 2019, 7:40pm

Excellent tip! Thanks @stevengj, you were very helpful

Topic		Replies	Views
String to byte array New to Julia	5	4025	May 17, 2017
Is `Vector{UInt8}(string)` the fastest way to convert String to bytes? Performance	19	4607	December 10, 2017
Write - accented characters take extra byte Internals & Design strings	6	684	November 11, 2021
Array of bytes to string General Usage strings	4	6897	October 31, 2017
UNICODE string from C++ to Julia and vice versa General Usage question , embedding , examples , cxx	20	3710	April 26, 2017

Convert String to Byte Array using two bytes per character

Related topics