Why so complex representation of a char?

Clear answer to my stupid question. Now, I can understand the underlying reason. Thanks @StefanKarpinski.

Not a stupid question at all. When @stevengj first proposed printing the extra information I was a bit thrown off too but the more I thought about it the more I realized “why not?” and couldn’t think of any good reason not to show extra info in this brave (not so) new world of Unicode.

1 Like

Could be great as a package though.

7 Likes

Sorry, just a little (stupid!) helper for impatient :wink:

julia> using HTTP, Formatting

julia> function what(c::Char)
                cc = uppercase(Formatting.sprintf1("%04x", (codepoint(c) & 0xff00)))
                r = HTTP.request("GET", "https://raw.githubusercontent.com/unicode-table/unicode-table-data/master/loc/en/symbols/$(cc).txt")
                s = split(String(r.body), '\n')
                c, s[1+(codepoint(c) & 0xff)]
              end
what (generic function with 1 method)

julia> what('\u0b17')
('ଗ', "0B17: Oriya Letter Ga")

julia> what('`')
('`', "0060: Grave Accent")

julia> what('\u3401')
('㐁', "3401: Ideograph to lick; to taste, a mat, bamboo bark CJK : tim2 : tiàn")

julia> what('ϵ')
('ϵ', "03F5: Greek Lunate Epsilon Symbol : straight epsilon")
3 Likes

If you are offline:

julia> pyimport("unicodedata")[:name]("ϵ")
"GREEK LUNATE EPSILON SYMBOL"

julia> pyimport("unicodedata")[:name]("ε")
"GREEK SMALL LETTER EPSILON"
3 Likes

This article make excellent job explaining why Unicode is needed and why is co complicated. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software
To be honest is sometimes silly, like defining A^y (A is wide bunch of symbols), but not A_y. After you read it I recommend Julia documentation on strings and chars. Strings · The Julia Language

2 Likes

That is a great article. Highly recommended reading! Fortunately a lot of the awkwardness of the past regarding Unicode and various encodings has gotten much better and people just use UTF-8 most of the time.

Also note that since that article was written UTF-8 has been limited to 4 bytes.

2 Likes

That article seems to convey the false impression that UTF-16 is the same as UCS-2 and is a fixed-width encoding, a misunderstanding that has led to innumerable bugs.

1 Like

I know that article is also linked from the manual, but am puzzled why people find it so great. It basically makes the following points:

  1. prior to Unicode, people used different encodings,
  2. Unicode standardizes that, great,
  3. codepoints are encoded into bytestreams in various ways, of which UTF-8 should be the one you care about.

All of these are valid and important, but buried in a lot of other irrelevant information like

A in a Times New Roman font is the same character as the A in a Helvetica font

2 Likes

BTW how could we write “bigger” unicode chars in Julia?

For example:
https://unicode-table.com/en/10180/

julia> '\u10180'
ERROR: syntax: invalid character literal

julia> '𐆀'
'𐆀': Unicode U+010180 (category So: Symbol, other)

julia> pyimport("unicodedata")[:name]("𐆀")
"GREEK FIVE OBOLS SIGN"

I haven’t read it in a long time but it is fairly old, although I’m pretty sure that that UTF-16 was already not the same as UCS-2 at that point.

For those who are curious, UCS-2 is a fixed width two-byte encoding that was once thought to be sufficient for all the world’s languages, but it turns out that one needs more than 65k characters for that. UCS-2 is roughly a fixed-width subencoding of UTF-16.

1 Like

Use a capital U.

Thanks! :slight_smile:

It is probably not necessary with Char literals especially if number of hexa-digits is not strict.

julia> '\U61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> '\u61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

I understand why is this important in Strings

julia> println("\u000101800")
01800

julia> println("\U000101800")
𐆀0

Although not requiring strict number of hexa-digits in string literals surprised me too

julia> println("\U65gg")
egg

julia> println("\U65bb")  # changing 'egg' to 'ebb' is not so simple
斻

I gave more thought on this issue and I am still feeling printing 58 characters for one char is too verbose.

julia> ‘a’
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

How about adopting more concise representation without losing all the information? For instance,

‘a’: ASCII U+0061 (Ll: Letter, lowercase)

To anyone who knows UTF-8, “U+” clearly indicates 0061 is its Unicode and “(Ll: Letter, lowercase)” is its Unicode category. Explicit is better than implicit but I think concise is better than verbose.

2 Likes

This is not problem to me, since I mostly use strings and strings of one char (if I remember correctly in Julia length of string is size in bytes, so one char string maybe not length one) to print messages. When I want to check what is one concrete char, this verbose description is very convenient. At least that is how I look at it.

It is of course possible to condense this, but I am wondering if this is a problem in practice. I imagine that when I am printing a single Char, or a sequence of such where I care about them individually, I would not mind the extra information since most likely I am debugging a function or doing something similar.

It would be great if you told us more about the context you find this information overwhelming.

I have very little interest in all that information, but as long as it doesn’t wrap to the next line, I don’t see the problem. Nor do I see any benefit to condensing it by some small amount.

3 Likes

@KZiemian. It is an informative article. Thanks for sharing it with me.

Ok. It sounds you guys are ok with printing the 50 characters for one character :slight_smile: As somebody said, there is no harm with the long representation but I can bet some other users will ask more concise one, as Julia extends its user base.

We already provide both the concise representation and the long description, as I explained above. You still haven’t explained the problem with this.

3 Likes