What is `Char`?


#1

Here’s something that came as a surprise

A = UInt8['a', 'b', 'c']

ptr = convert(Ptr{Char}, pointer(A))

a = unsafe_load(ptr, 1)  # this is not 'a'; it is '\U636261'

so Julia for some reason decided to load a 24-bit Char (correction: no, it’s actually 32 bit, see below). I’m guessing that people are going to tell me that the right way to go about doing this is simply to load a UInt8 and convert to Char (not ideal as it requires special handling in code which is designed to load generic stuff). I don’t even know why it stopped at 24 bits and not 16 or 32. Is there a way of loading directly to an 8-bit Char? I’m now pretty confused about what a Char even is.


#2

Char represents a 32 bit Unicode code point ( I think :D) so the translation from x[idx]::UInt8 is not straight forward and you might need multiple (up to 4) UInt8 elements to construct a valid char from an uint8 utf8 string array.
On 0.6 & 0.7 I get a 32 bit char out of this, which is what you expect.
What you do should be an out of bounds error, wouldn’t it be for the unsafe, since your array only has 24 bit :wink:


#3

Sorry, correction to the above: when I do sizeof(a) I do indeed get 4. I was confused because \U636261 looks 24-bit. Evidently you are right @sdanisch, they are always 32-bit.


#4

You are casting a pointer to UInt8 into Char, which is a 32-bit quantity.
That’s why you are seeing the 3 characters packed into that one 32-bit word (along with a trailing \0).

In master, Char is not even simply a Unicode code point (i.e. 0-0xd7ff, 0xe000-0x10ffff) anymore,
it is a rather complex method of packing a 1 to 4 byte UTF-8 encoding of a Unicode code point into a 32-bit value, and also allowing storing invalid UTF-8 1 to 4 byte sequences.


#5

It’s not really complicated at all. Char values are now just UTF-8 bytes padded with trailing zeros:

julia> reinterpret(UInt32, '∀')
0xe2888000

julia> codeunits("∀")
3-element Base.CodeUnits{UInt8,String}:
 0xe2
 0x88
 0x80

#6

And that is very complex, compared to simply having the code point stored as its numerical value.
As I showed elsewhere, the difference in code generated to pack and unpack that UTF-8 based format is very large (it’s a noop in v0.6 and earlier versions of Julia to go between Char and UInt32)
It also means that you can’t share a Vector{Char} any more with a C/C++ array of wchar_t (when it is 32-bit) or char32_t.


#7

Can we avoid having this discussion again please?