Just to expand on this, it’s because in Julia, strings are internally encoded as UTF-8, which is a variable length encoding, meaning each character is encoded by a variable number of bytes. Unicode code points larger than 0x80 are the first to be encoded using two bytes, so that’s why only those show up with two bytes.
I’m not familiar with neither C++ nor QString, but my best bet for why those nullbytes get inserted there is because the representation chosen there is actually multiple null-terminated single character strings, which may or may not have multiple bytes themselves. If my hunch is correct, testing the string given by @longemen3000 should add 00 c2 b1 to your output.
That’s because QString is using the UTF-16 encoding of Unicode, whereas Julia uses UTF-8. UTF-16 is two bytes for most characters, but for some characters it is 4 bytes; wrongly assuming it is 2 bytes for every character is a common source of subtle bugs. (A 16-bit QChar is one “code unit” but is not one Unicode codepoint in general.)
You can convert a Julia string to native-endian UTF-16 using transcode(UInt16, somestring)
With the suggestion of @stevengj, and just for record, the following function can be used to encode a Julia string into something that can be converted to QString very easy using QDataStream:
"""
function convert_to_byte_array(x)
Convert the data `x` into an array of bytes.
"""
function convert_to_byte_array(x::T) where T<:Union{Float16,Float32,Float64,
Signed,Unsigned,Bool}
iob = IOBuffer()
write(iob, x)
seekstart(iob)
return read(iob)
end
function convert_to_byte_array(x::T) where T<:Union{Vector{Float16},
Vector{Float32},
Vector{Float64},
Vector{Int8},
Vector{Int16},
Vector{Int32},
Vector{Int64},
Vector{UInt8},
Vector{UInt16},
Vector{UInt32},
Vector{UInt64},
Vector{Bool}}
num_elems = length(x)
byte_arrays = Vector{Vector{UInt8}}(undef, num_elems)
num_bytes = 0
# Convert each element to a byte array.
@inbounds for i = 1:num_elems
ba = convert_to_byte_array(x[i])
num_bytes += length(ba)
byte_arrays[i] = ba
end
# This is similat to `vcat(byte_arrays...)`, but more fast.
byte_array = Vector{UInt8}(undef, num_bytes)
ind = 1
@inbounds for i = 1:num_elems
ba = byte_arrays[i]
for j = 1:length(ba)
byte_array[ind] = ba[j]
ind += 1
end
end
return byte_array
end
function convert_to_byte_array(str::String)
# Here, we will use the same format as `QString`, which is:
#
# |----- UInt32 -----|
# | Number of characters | String encoded using UTF-16 |
str_utf16 = transcode(UInt16, str)
iob = IOBuffer()
write(iob, str_utf16)
seekstart(iob)
str_encoded = read(iob)
str_size = UInt32(length(str_encoded))
return vcat(convert_to_byte_array(str_size),str_encoded)
end
If this is transmitted through TCP to a Qt software, you will just need to do the following to read the string:
QString str;
ds >> str;
in which ds is a QDataStream. Just remember to set to little endian!
First, if you want to convert to a byte array, you can just do reinterpret(UInt8, transcode(UInt16, str)).
Second, I’m not sure why you need to convert it to a byte array. You can use write to directly output a UInt16 array to a file or other stream, for example.
Good! I was using this code since like 0.3 or 0.4 and that was the only way to do this by that time, IIRC. I will change to your suggestion since it is way better.
Because I need to merge with some other things to create the protocol (checksum, headers, configuration bits, etc.).
You still don’t need to convert things individually to byte arrays first. Just write each part of the protocol in sequence to a given stream (an IOBuffer if you want to get the whole message as a buffer/array). Don’t convert to separate byte arrays and concatenate them afterwards.