Convert String to Byte Array using two bytes per character

Hi guys!

I need to convert a string to a byte array, but using always two bytes per character. For example, in Julia, I have:

julia> Vector{UInt8}("Teste")
5-element Array{UInt8,1}:
 0x54
 0x65
 0x73
 0x74
 0x65

However, in C++ (using QString), I get the following sequence of bytes for the same string:

00 74 00 65 00 73 00 74 00 65

Is there an easy way to do this in Julia?

for what appears in the documentation, the limit is the UInt8 value 0x80, after that, it generates two bytes per string, look at this for example:

julia> Vector{UInt8}("Teste±")
7-element Array{UInt8,1}:
 0x54
 0x65
 0x73
 0x74
 0x65
 0xc2
 0xb1
1 Like

Just to expand on this, it’s because in Julia, strings are internally encoded as UTF-8, which is a variable length encoding, meaning each character is encoded by a variable number of bytes. Unicode code points larger than 0x80 are the first to be encoded using two bytes, so that’s why only those show up with two bytes.

I’m not familiar with neither C++ nor QString, but my best bet for why those nullbytes get inserted there is because the representation chosen there is actually multiple null-terminated single character strings, which may or may not have multiple bytes themselves. If my hunch is correct, testing the string given by @longemen3000 should add 00 c2 b1 to your output.

1 Like

That’s because QString is using the UTF-16 encoding of Unicode, whereas Julia uses UTF-8. UTF-16 is two bytes for most characters, but for some characters it is 4 bytes; wrongly assuming it is 2 bytes for every character is a common source of subtle bugs. (A 16-bit QChar is one “code unit” but is not one Unicode codepoint in general.)

You can convert a Julia string to native-endian UTF-16 using transcode(UInt16, somestring)

7 Likes

Thanks @stevengj!

With the suggestion of @stevengj, and just for record, the following function can be used to encode a Julia string into something that can be converted to QString very easy using QDataStream:

"""
    function convert_to_byte_array(x)

Convert the data `x` into an array of bytes.

"""
function convert_to_byte_array(x::T) where T<:Union{Float16,Float32,Float64,
                                                    Signed,Unsigned,Bool}
    iob = IOBuffer()
    write(iob, x)
    seekstart(iob)
    return read(iob)
end

function convert_to_byte_array(x::T) where T<:Union{Vector{Float16},
                                                    Vector{Float32},
                                                    Vector{Float64},
                                                    Vector{Int8},
                                                    Vector{Int16},
                                                    Vector{Int32},
                                                    Vector{Int64},
                                                    Vector{UInt8},
                                                    Vector{UInt16},
                                                    Vector{UInt32},
                                                    Vector{UInt64},
                                                    Vector{Bool}}
    num_elems   = length(x)
    byte_arrays = Vector{Vector{UInt8}}(undef, num_elems)
    num_bytes   = 0

    # Convert each element to a byte array.
    @inbounds for i = 1:num_elems
        ba              = convert_to_byte_array(x[i])
        num_bytes      += length(ba)
        byte_arrays[i]  = ba
    end

    # This is similat to `vcat(byte_arrays...)`, but more fast.
    byte_array = Vector{UInt8}(undef, num_bytes)
    ind        = 1

    @inbounds for i = 1:num_elems
        ba = byte_arrays[i]

        for j = 1:length(ba)
            byte_array[ind] = ba[j]
            ind += 1
        end
    end

    return byte_array
end

function convert_to_byte_array(str::String)
    # Here, we will use the same format as `QString`, which is:
    #
    #  |-----   UInt32   -----|
    #  | Number of characters | String encoded using UTF-16 |

    str_utf16 = transcode(UInt16, str)
    iob = IOBuffer()
    write(iob, str_utf16)
    seekstart(iob)
    str_encoded = read(iob)
    str_size = UInt32(length(str_encoded))

    return vcat(convert_to_byte_array(str_size),str_encoded)
end

If this is transmitted through TCP to a Qt software, you will just need to do the following to read the string:

QString str;
ds >> str;

in which ds is a QDataStream. Just remember to set to little endian!

First, if you want to convert to a byte array, you can just do reinterpret(UInt8, transcode(UInt16, str)).

Second, I’m not sure why you need to convert it to a byte array. You can use write to directly output a UInt16 array to a file or other stream, for example.

1 Like

Good! I was using this code since like 0.3 or 0.4 and that was the only way to do this by that time, IIRC. I will change to your suggestion since it is way better.

Because I need to merge with some other things to create the protocol (checksum, headers, configuration bits, etc.).

You still don’t need to convert things individually to byte arrays first. Just write each part of the protocol in sequence to a given stream (an IOBuffer if you want to get the whole message as a buffer/array). Don’t convert to separate byte arrays and concatenate them afterwards.

Note also that QString has a method to construct it from UTF-8, so you could also do the conversion on the other end.

2 Likes

Excellent tip! Thanks @stevengj, you were very helpful :slight_smile: