Understanding `sizeof` return values on `Char` / `String`

Can someone explain this behavior of sizeof vs summarysize ?

sizeof("z")
# 1
sizeof('z')
# 4
Base.summarysize('z')
# 4
Base.summarysize("z")
# 9

When I read the doc

sizeof(str::AbstractString)
Size, in bytes, of the string str. Equal to the number of code units in str multiplied by the size, in bytes, of one code unit in str.
I understand that in this case sizeof and summarysize should return the same value… What am I missing ?

Some context : I want to convert a Vector of Strings into a Vector of some struct by splitting the strings at some separator, then convert the obtained substrings to more appropriate formats (Char, Int …) if possible.

I am on Julia 1.7.0-rc1

sizeof('z') == 4 because a Char is stored as a 32-bit value (see ?Char). This is required so any Unicode codepoint can fit in a Char.

sizeof("z") == 1 because encoding “z” in UTF-8 takes only one byte.

Base.summarysize('z') == 4 because a Char is a simple value type.

Base.summarysize("z") == 9 because… hum I’m not sure: I thing this counts 8 bytes for the pointer to the region of memory that holds the string, and 1 byte for the string itself. But it should also count some bytes for storing the length of the string?

2 Likes

The reason that this is 9 is that, internally, a String consists both of an array of bytes (UTF-8 code units for the encoded string) and an internal length::Int field and summarysize includes the Int size. sizeof(Int) == 8 on a 64-bit machine, and 1+8 == 9. (Technically, a String object may have an even bigger footprint in memory: not only may it implicitly include a 1-byte NUL terminator for ease of passing to C, but a heap-allocated Julia value can also have a preamble with a type tag and some other info.) In contrast, sizeof only gives you the size of the underlying String data and not the Julia wrappers thereof.

3 Likes