Why so complex representation of a char?


#31

BTW how could we write “bigger” unicode chars in Julia?

For example:

julia> '\u10180'
ERROR: syntax: invalid character literal

julia> '𐆀'
'𐆀': Unicode U+010180 (category So: Symbol, other)

julia> pyimport("unicodedata")[:name]("𐆀")
"GREEK FIVE OBOLS SIGN"

#32

I haven’t read it in a long time but it is fairly old, although I’m pretty sure that that UTF-16 was already not the same as UCS-2 at that point.

For those who are curious, UCS-2 is a fixed width two-byte encoding that was once thought to be sufficient for all the world’s languages, but it turns out that one needs more than 65k characters for that. UCS-2 is roughly a fixed-width subencoding of UTF-16.


#33

Use a capital U.


#34

Thanks! :slight_smile:

It is probably not necessary with Char literals especially if number of hexa-digits is not strict.

julia> '\U61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> '\u61'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

I understand why is this important in Strings

julia> println("\u000101800")
01800

julia> println("\U000101800")
𐆀0

Although not requiring strict number of hexa-digits in string literals surprised me too

julia> println("\U65gg")
egg

julia> println("\U65bb")  # changing 'egg' to 'ebb' is not so simple
斻

#35

I gave more thought on this issue and I am still feeling printing 58 characters for one char is too verbose.

julia> ‘a’
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

How about adopting more concise representation without losing all the information? For instance,

‘a’: ASCII U+0061 (Ll: Letter, lowercase)

To anyone who knows UTF-8, “U+” clearly indicates 0061 is its Unicode and “(Ll: Letter, lowercase)” is its Unicode category. Explicit is better than implicit but I think concise is better than verbose.


#36

This is not problem to me, since I mostly use strings and strings of one char (if I remember correctly in Julia length of string is size in bytes, so one char string maybe not length one) to print messages. When I want to check what is one concrete char, this verbose description is very convenient. At least that is how I look at it.


#37

It is of course possible to condense this, but I am wondering if this is a problem in practice. I imagine that when I am printing a single Char, or a sequence of such where I care about them individually, I would not mind the extra information since most likely I am debugging a function or doing something similar.

It would be great if you told us more about the context you find this information overwhelming.


#38

I have very little interest in all that information, but as long as it doesn’t wrap to the next line, I don’t see the problem. Nor do I see any benefit to condensing it by some small amount.


#39

@KZiemian. It is an informative article. Thanks for sharing it with me.


#40

Ok. It sounds you guys are ok with printing the 50 characters for one character :slight_smile: As somebody said, there is no harm with the long representation but I can bet some other users will ask more concise one, as Julia extends its user base.


#41

We already provide both the concise representation and the long description, as I explained above. You still haven’t explained the problem with this.


#42

You are mentioning the following question, right? If so, I did not propose to lose any information about the UTF-8 details. Instead, what I proposed is just to drop some superfluous words e.g. “Category” from:
‘a’: ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

I think I can raise a question in the same way you asked. What is the upside of printing the superfluous word over and over, whenever users type characters directly in REPL? No need for any immediate change. Let’s see how many users support the idea of more concise representation for a character.


#43

Whether to leave out the word category or not seems like a trivial issue.

It really doesn’t matter whether we have that word or not, and I’m not sure it’s really worth debating either.


#44

Looking at this from the perspective of someone not already familiar with Unicode, if you don’t know how ASCII works and that Unicode has letter categories, how would you ever find out what those mysterious U+xxxx and Ll mean? It’s not at all clear to me that U+xxxx in your proposed representation is not a part of ASCII: ‘a’: ASCII U+0061 (Ll: Letter, lowercase). Sure it’s more information for someone familiar with Unicode, but I think the increased clarity is better when someone’s not familiar with the topic.


#45

When characters are used in isolation, it is reasonable to assume that detailed information is desired. Note that when they are used in collections or to form strings, only the character is printed. Eg

julia> 'f'
'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)

julia> collect("foo")
3-element Array{Char,1}:
 'f'
 'o'
 'o'

julia> "foo"
"foo"

#46

I did not see any problem with complex representation before! :wink:

It is not just when user type character to REPL:

julia> a_or_b() = rand(Bool) ? 'a' : 'b';

julia> a_or_b()
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

If you are not familiar with Unicode why would you need info about category?

And if you need to know additional information about character why not use some special function(s)?

One could for example use @tkf’s nice hack:

julia> import PyCall

julia> unicodename(a::Char) = PyCall.pyimport("unicodedata")[:name](string(a));

#47

Yeah, there’s always quite a bit of meta information in these outputs, so it’s not a big deal to me.

But it does seem like something that could be configurable. I’ve seen some requests to mimic Matlab’s format command when printing numbers:

>> format short
>> a
a =
    0.0853

>> format long
>> a
a =
   0.085307484729089

which seems neat. Having an option like that, also for Chars, could be quite handy.


Is there a way to preserve overwritten method?
#48

Yes nothing serious, but I was surprised at all the inputs I received on my small suggestion.


#49

@Liso. Thanks for being with me. Yes, I like the Python approach.
Julia doesn’t have to bother its users (particularly beginners like me) with all the details. In most cases, users may not need all the details and when they really need the details, they can just call a function to spit out all the details. Explicit is better than implicit but also concise is better than verbose.


#50

Explicit is better than implicit - I read that it is better to ask for some behavior (for example additional info about char :wink: ) explicitly and not get it implicitly.

But as I wrote - complex representation is not something that bother me much… (and I also wrote simple (?) possibility how to avoid complex info if somebody need it)