Why so complex representation of a char?

peters · November 2, 2018, 10:09am

Clear answer to my stupid question. Now, I can understand the underlying reason. Thanks @StefanKarpinski.

StefanKarpinski · November 2, 2018, 12:41pm

Not a stupid question at all. When @stevengj first proposed printing the extra information I was a bit thrown off too but the more I thought about it the more I realized “why not?” and couldn’t think of any good reason not to show extra info in this brave (not so) new world of Unicode.

simonbyrne · November 2, 2018, 6:11pm

Could be great as a package though.

Liso · November 2, 2018, 7:59pm

Sorry, just a little (stupid!) helper for impatient

julia> using HTTP, Formatting

julia> function what(c::Char)
                cc = uppercase(Formatting.sprintf1("%04x", (codepoint(c) & 0xff00)))
                r = HTTP.request("GET", "https://raw.githubusercontent.com/unicode-table/unicode-table-data/master/loc/en/symbols/$(cc).txt")
                s = split(String(r.body), '\n')
                c, s[1+(codepoint(c) & 0xff)]
              end
what (generic function with 1 method)

julia> what('\u0b17')
('ଗ', "0B17: Oriya Letter Ga")

julia> what('`')
('`', "0060: Grave Accent")

julia> what('\u3401')
('㐁', "3401: Ideograph to lick; to taste, a mat, bamboo bark CJK : tim2 : tiàn")

julia> what('ϵ')
('ϵ', "03F5: Greek Lunate Epsilon Symbol : straight epsilon")

tkf · November 2, 2018, 8:20pm

If you are offline:

julia> pyimport("unicodedata")[:name]("ϵ")
"GREEK LUNATE EPSILON SYMBOL"

julia> pyimport("unicodedata")[:name]("ε")
"GREEK SMALL LETTER EPSILON"

KZiemian · November 2, 2018, 10:11pm

This article make excellent job explaining why Unicode is needed and why is co complicated. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software
To be honest is sometimes silly, like defining A^y (A is wide bunch of symbols), but not A_y. After you read it I recommend Julia documentation on strings and chars. Strings · The Julia Language

StefanKarpinski · November 2, 2018, 11:00pm

That is a great article. Highly recommended reading! Fortunately a lot of the awkwardness of the past regarding Unicode and various encodings has gotten much better and people just use UTF-8 most of the time.

Also note that since that article was written UTF-8 has been limited to 4 bytes.

stevengj · November 3, 2018, 1:17am

That article seems to convey the false impression that UTF-16 is the same as UCS-2 and is a fixed-width encoding, a misunderstanding that has led to innumerable bugs.

Tamas_Papp · November 3, 2018, 8:15am

I know that article is also linked from the manual, but am puzzled why people find it so great. It basically makes the following points:

prior to Unicode, people used different encodings,
Unicode standardizes that, great,
codepoints are encoded into bytestreams in various ways, of which UTF-8 should be the one you care about.

All of these are valid and important, but buried in a lot of other irrelevant information like

A in a Times New Roman font is the same character as the A in a Helvetica font

Liso · November 3, 2018, 9:35am

BTW how could we write “bigger” unicode chars in Julia?

For example:
https://unicode-table.com/en/10180/

julia> '\u10180'
ERROR: syntax: invalid character literal

julia> '𐆀'
'𐆀': Unicode U+010180 (category So: Symbol, other)

julia> pyimport("unicodedata")[:name]("𐆀")
"GREEK FIVE OBOLS SIGN"

StefanKarpinski · November 3, 2018, 1:32pm

I haven’t read it in a long time but it is fairly old, although I’m pretty sure that that UTF-16 was already not the same as UCS-2 at that point.

For those who are curious, UCS-2 is a fixed width two-byte encoding that was once thought to be sufficient for all the world’s languages, but it turns out that one needs more than 65k characters for that. UCS-2 is roughly a fixed-width subencoding of UTF-16.

StefanKarpinski · November 3, 2018, 1:32pm

Use a capital U.

Liso · November 3, 2018, 3:14pm

Thanks!

It is probably not necessary with Char literals especially if number of hexa-digits is not strict.

julia> '\U61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> '\u61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

I understand why is this important in Strings

julia> println("\u000101800")
01800

julia> println("\U000101800")
𐆀0

Although not requiring strict number of hexa-digits in string literals surprised me too

julia> println("\U65gg")
egg

julia> println("\U65bb")  # changing 'egg' to 'ebb' is not so simple
斻

peters · November 4, 2018, 1:15pm

I gave more thought on this issue and I am still feeling printing 58 characters for one char is too verbose.

julia> ‘a’
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

How about adopting more concise representation without losing all the information? For instance,

‘a’: ASCII U+0061 (Ll: Letter, lowercase)

To anyone who knows UTF-8, “U+” clearly indicates 0061 is its Unicode and “(Ll: Letter, lowercase)” is its Unicode category. Explicit is better than implicit but I think concise is better than verbose.

KZiemian · November 4, 2018, 1:20pm

This is not problem to me, since I mostly use strings and strings of one char (if I remember correctly in Julia length of string is size in bytes, so one char string maybe not length one) to print messages. When I want to check what is one concrete char, this verbose description is very convenient. At least that is how I look at it.

Tamas_Papp · November 4, 2018, 3:25pm

It is of course possible to condense this, but I am wondering if this is a problem in practice. I imagine that when I am printing a single Char, or a sequence of such where I care about them individually, I would not mind the extra information since most likely I am debugging a function or doing something similar.

It would be great if you told us more about the context you find this information overwhelming.

DNF · November 4, 2018, 6:03pm

I have very little interest in all that information, but as long as it doesn’t wrap to the next line, I don’t see the problem. Nor do I see any benefit to condensing it by some small amount.

peters · November 5, 2018, 10:45pm

@KZiemian. It is an informative article. Thanks for sharing it with me.

peters · November 14, 2018, 12:53am

Ok. It sounds you guys are ok with printing the 50 characters for one character As somebody said, there is no harm with the long representation but I can bet some other users will ask more concise one, as Julia extends its user base.

stevengj · November 14, 2018, 1:10am

We already provide both the concise representation and the long description, as I explained above. You still haven’t explained the problem with this.

Topic		Replies	Views
Accessing the category of a Char General Usage question , unicode	4	311	August 13, 2023
Steven Johnson's #19847 (more verbose multi-line display for Char) Internals & Design	1	658	January 4, 2017
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	767	December 5, 2020
Tab completion of \uXXXX in the REPL? Internals & Design unicode	23	4369	January 12, 2024
What is difference between "a" and 'a'? New to Julia question , strings	6	1134	October 6, 2019

Why so complex representation of a char?

Related topics