Write - accented characters take extra byte

yoh-meyers · November 10, 2021, 5:49pm

Hello,

When calling write with a string containing accented characters, I realise that 2 bytes are returned.

E.g:

julia> write(stdout, "ñ")
ñ2

julia> write(stdout, "n")
n1

This is an issue when the output is expecting a given length, for example HTTP parsing, relying on Sockets.

How can this be overcome?

Checking Julia code base, I see that ccall with chars is being used.

Oscar_Smith · November 10, 2021, 5:50pm

This is just standard UTF8. Not all characters are one byte.

yoh-meyers · November 10, 2021, 5:53pm

Ah yes, makes sense indeed!

mbauman · November 10, 2021, 6:10pm

You can use sizeof to ask for the number of bytes required to write the string. length returns the number of characters.

yoh-meyers · November 10, 2021, 6:22pm

Ah perfect, I was using transcode(UInt8, string).
No unnecessary conversion, with your suggestion!

ptoche · November 10, 2021, 6:36pm

Many years ago my phone company charged per sms, where each sms had a fixed size. I discovered that while my phone indicated the number of characters used, the phone company charged by “byte”. I was careful not to omit accents then and typically maxed the allowed limit as indicated by my phone. As a result I exceeded the number of allowed bytes and ended up being charged for 2 sms for every one I had sent. Those were international rates too. Ah the naughties.

amrods · November 11, 2021, 10:31pm

then “boomers” complained about kids writing “hru” and similar atrocities, haha

Topic		Replies	Views
Convert String to Byte Array using two bytes per character General Usage strings , convert	9	7633	May 4, 2019
How do I find the number of bytes for a character? New to Julia strings , indexing , unicode	3	206	December 24, 2024
Performance of length(::String) Performance	24	3933	July 28, 2018
What is difference between "a" and 'a'? New to Julia question , strings	6	1139	October 6, 2019
Writing 8-bit bytes General Usage	6	687	September 22, 2018

Write - accented characters take extra byte

Related topics