What is `Char`?

ExpandingMan · January 19, 2018, 4:48pm

Here’s something that came as a surprise

A = UInt8['a', 'b', 'c']

ptr = convert(Ptr{Char}, pointer(A))

a = unsafe_load(ptr, 1)  # this is not 'a'; it is '\U636261'

so Julia for some reason decided to load a 24-bit Char (correction: no, it’s actually 32 bit, see below). I’m guessing that people are going to tell me that the right way to go about doing this is simply to load a UInt8 and convert to Char (not ideal as it requires special handling in code which is designed to load generic stuff). I don’t even know why it stopped at 24 bits and not 16 or 32. Is there a way of loading directly to an 8-bit Char? I’m now pretty confused about what a Char even is.

sdanisch · January 19, 2018, 5:08pm

Char represents a 32 bit Unicode code point ( I think :D) so the translation from x[idx]::UInt8 is not straight forward and you might need multiple (up to 4) UInt8 elements to construct a valid char from an uint8 utf8 string array.
On 0.6 & 0.7 I get a 32 bit char out of this, which is what you expect.
What you do should be an out of bounds error, wouldn’t it be for the unsafe, since your array only has 24 bit

ExpandingMan · January 19, 2018, 5:24pm

Sorry, correction to the above: when I do sizeof(a) I do indeed get 4. I was confused because \U636261 looks 24-bit. Evidently you are right @sdanisch, they are always 32-bit.

ScottPJones · January 19, 2018, 9:09pm

You are casting a pointer to UInt8 into Char, which is a 32-bit quantity.
That’s why you are seeing the 3 characters packed into that one 32-bit word (along with a trailing \0).

In master, Char is not even simply a Unicode code point (i.e. 0-0xd7ff, 0xe000-0x10ffff) anymore,
it is a rather complex method of packing a 1 to 4 byte UTF-8 encoding of a Unicode code point into a 32-bit value, and also allowing storing invalid UTF-8 1 to 4 byte sequences.

StefanKarpinski · January 19, 2018, 10:30pm

It’s not really complicated at all. Char values are now just UTF-8 bytes padded with trailing zeros:

julia> reinterpret(UInt32, '∀')
0xe2888000

julia> codeunits("∀")
3-element Base.CodeUnits{UInt8,String}:
 0xe2
 0x88
 0x80

ScottPJones · January 20, 2018, 1:30pm

And that is very complex, compared to simply having the code point stored as its numerical value.
As I showed elsewhere, the difference in code generated to pack and unpack that UTF-8 based format is very large (it’s a noop in v0.6 and earlier versions of Julia to go between Char and UInt32)
It also means that you can’t share a Vector{Char} any more with a C/C++ array of wchar_t (when it is 32-bit) or char32_t.

nalimilan · January 20, 2018, 2:47pm

Can we avoid having this discussion again please?

Topic		Replies	Views
How does Char get stored? Internals & Design strings	7	1725	October 31, 2020
How to retrieve Char in C New to Julia question	5	581	March 17, 2020
A undocumented breaking change in Julia0.7? General Usage	1	926	September 17, 2018
Changes to the representation of Char Internals & Design	14	2853	December 12, 2017
AbstractChar (and #26286) Internals & Design	2	607	March 3, 2018

What is `Char`?

Related topics