Stupid question on Unicode

johnh · August 18, 2019, 7:43am

Julia 1.3.0 supports Unicode version 12.1.0I can see how to use characters such as h-bar
println("\u0127")

I have the rather odd notion to print some Cuneiform characters - I am interested in Babylonian history and went to see the excellent Hisoory of Writing exhibition at the British Library.
In Cuneiform a quantity is indicated by poking a round hole in the clay - I think this may be the origin of the 0 character.

Cuneiform starts at character code 12000 yet when I print
println("\u12003")

This produces the HA character followed by 3. I imagine the secret here is changing the font, which I do not know how to do. I have installed a cuneiform font package - I use Fedora Linux.
Please tell me if I am barking mad, as usual.
And no I am not programming in Cuneiform. I leave fancy programming such as Roman numerals, Swedish Chef and white space encoding to our Perl colleagues.

Tamas_Papp · August 18, 2019, 9:22am

It would help to know where you are printing these things (terminal, or which IDE, or some other output).

I have been tempted to use Old Hungarian runes when I run out of Greek letters, but so far managed to resist.

johnh · August 18, 2019, 10:47am

My bad. Ths is on Fedore Linux using a Gnome terminal and a REPL

Tamas_Papp · August 18, 2019, 11:01am

https://help.gnome.org/users/gnome-terminal/stable/app-fonts.html.en

or gnome-tweak-tools should work.

BLI · August 18, 2019, 11:30am

I browsed a volume on the field equations of physics in an encyclopedia during my PhD study – I’ll dig up the author (I looked into irreversible thermodynamics)… the volume had some 700+ pages, and the author used a large number of alphabets to cover his notation

cormullion · August 18, 2019, 4:57pm

(I hope you’ve seen Irvin Finkel’s great presentation on YouTube… (https://youtu.be/PfYYraMgiBA))

I downloaded Noto Sans Cuneiform font, which may be one of a very few fonts that offer these glyphs. Then:

using Luxor
@png begin
    tablet = Tiler(500, 500, 15, 15)
    fontface("NotoSansCuneiform-Regular")
    fontsize(15)
    for (pos, n) in tablet
        text(string(Char(0x12000 + n)), pos)
    end

end 500 500 "/tmp/iCuneiform"

oldskool

ffevotte · August 18, 2019, 7:30pm

In addition to the font issues, if your unicode is more than 4 digits long, you should use the \U escape sequence (note the upper case “U”)

https://pkg.julialang.org/docs/julia/THl1k/1.1.1/manual/strings.html#man-characters-1

johnh · August 19, 2019, 6:22am

Thankyou all. The presentation on Cuneiform is a textbook example of how to deliver a good presentation. Be animated and excited by your subject. Be an expert of course, but do not use jargon to confuse. Take the audience on a journey.

The problem is solved by @ffevotte pointing otu that \U should be used. Thankyou!

Solomon.Rutzky · August 19, 2019, 6:29am

Hi John. It appears that Julia handles \u like most other languages and only handles 4 hex digits (i.e. BMP characters; the first 65,536 code points). You can see this in the output that you received. You said:

when I print
println("\u12003")
This produces the HA character followed by 3

Well, U+1200 is the “Ethiopic Syllable Ha” (i.e. \u1200), and the “3” in your output is the “3” in \u12003.

However, you are dealing with supplementary characters (i.e. everything from U+10000 to U+10FFFF). You will likely need to use the \U escape sequence. When using this escape sequence, you should pass in all 8 characrs. For example, \U00012003 should be what you are looking for. Unfortunately, I have no way to test .

Please also see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

johnh · August 19, 2019, 6:33am

@Solomon.Rutzky Thankyou. This does indeed work
println("\U00012003")
𒀃

Solomon.Rutzky · August 19, 2019, 6:39am

Thanks for confirming. I forgot to mention in my previous comment that for supplementary characters, many languages allow you to specify the surrogate pair via a double-\u sequence. For U+12003, that would be: \uD83D\uDE3F (same as the UTF-16BE hex notation)

stevengj · August 19, 2019, 1:27pm

Surrogate pairs are not Unicode characters, they are a property of the UTF-16 encoding, which Julia doesn’t use—there are no surrogate pairs in UTF-8. That they are exposed in the string escapes is a bit of a leaky abstraction in languages like JavaScript.

Solomon.Rutzky · August 19, 2019, 3:36pm

Hello Steven. Yes, I am aware that surrogate pairs are a UTF-16-specific construct. However, string escapes aren’t byte sequences of a particular encoding. They are somewhat arbitrary substitutions / macros. There is no inherent meaning in \t any more than there is in \x09, \u0009, or \U00000009. The only requirement is that they behave according to the language specification. If those in charge of the language decided that \T or even \q were to be the tab escape, then that might be a poor choice due to the confusion it would cause, but it wouldn’t be “wrong” per-se.

Along those same lines, I view the fact that many languages parse surrogate pairs specified via \u into their respective correct code points (regardless of how strings are stored internally) as a convenience that there is little to no reason to not support (especially for languages that do not, or did not in the past, support the \U, or equivalent, escape, such as JavaScript).

And, now that I think about it, I would argue that tying any of this escape sequence stuff to the encoding used internally would be a leaky abstraction.

StefanKarpinski · August 19, 2019, 6:39pm

Julia does allow you to use \uD83D\uDE3F to create a String value with two invalid surrogate code points:

julia> "\uD83D\uDE3F"
"\ud83d\ude3f"

Since the String type is UTF-8 based, that means having two separate well-formed but invalid characters with surrogate code points. Specifically, this is an instance of the WTF-8 extension of UTF-8.

This is different than what would happen if you constructed a UTF-16 string with two invalid surrogates in the right order, which is that you’d get a correctly encoded character. This is similar to how you can form a correctly encoded UTF-8 character by stringing together the right sequence of code units, which would independently be invalid:

julia> "\xe2\x88\x80"
"∀"

But these “tricks” for writing valid characters as a sequence of invalid code units are inherently tied to a given encoding, which I suspect it what @stevengj means by a leaky abstraction. The benefit of allowing directly writing invalid strings in terms of individual bytes is that you can write arbitrary data that is mostly string-like, which is often quite useful. It’s much more useful with a UTF-8 based string type than with a UTF-16 based string type since UTF-8 can work with individual bytes whereas the code unit of UTF-16 is a byte pair.

Solomon.Rutzky · August 19, 2019, 10:29pm

Hello Stefan, and thanks for the reply.

Based on the output, I would disagree that Julia is “allowing” this. It seems to be merely passing it through as it does not know how to handle it (which is fine if that is what the designers of the Julia language want the behavior to be; I am not arguing about what should or shouldn’t be happening).

Also, thanks for confirming that Julia does not support using two \u escape sequences with surrogate code points to create supplementary characters.

I believe it would be more accurate to begin that statement with: “Since the programmers who wrote the parser for Julia did not handle the case of using two \u sequences to specify supplementary characters as surrogate pairs, that means…”

StefanKarpinski:

This is similar to how you can form a correctly encoded UTF-8 character by stringing together the right sequence of code units, which would independently be invalid:
julia> "\xe2\x88\x80"
"∀"
But these “tricks” for writing valid characters as a sequence of invalid code units are inherently tied to a given encoding, which I suspect it what @stevengj means by a leaky abstraction.

I think the fact that the surrogate pair escape sequence is also an encoding is causing confusion by making it hard to see the distinction that I am referring to.

Let’s take a step back. Prior to execution, source code is parsed for syntax, grammar, etc. Part of that parsing is unescaping any escape sequences into the intended characters in whatever encoding is used. As such, escape sequences, by their very nature, are not directly related to the final encoding, but indirectly related by way of the parser that translates \t into the byte sequence of \x09. And, the parser is free to do whatever the languages designers decided it should do. That \t equates to a tab in most languages is merely due to convention, and nothing more.

The \u sequence translates the BMP code point into UTF-8 because the parser is coded to do that. And, a \U sequence, which is actually UTF-32 (which is equivalent to full Unicode code point: BMP + Supplementary, certainly not UTF-8) only translates \U00012003 into the correct character because the parser is coded to do that (though not all languages have been coded to do this, or sometimes use a slightly different syntax, such as \u{12003} ). Also, \uD83D\uDE3F does produce the correct supplementary character in .NET, but not necessarily due to .NET using UTF-16, since if this truly were encoding-specific, then those sequences would be incorrect as they are code points / Big Endian, and .NET is UTF-16 Little Endian.

So, my point is that escape sequences are a feature of the language (the parser specifically), not a feature of Unicode or any particular encoding. Hence, it cannot be assumed which escape sequences are and are not supported by any given language. If Julia does not support creating supplementary characters via surrogate pairs specified using \u, then that’s fine. But, it could have been supported and that also would have been fine, so it was worth clarifying.

StefanKarpinski · August 19, 2019, 11:05pm

The endianness of the UTF-16 encoding only refers to the order in which the two bytes in each code unit appear, it does not affect the order of those code units: the high surrogate always comes first and the low surrogate always comes second (and because UTF-16 is the worst, the high surrogate is always encoded with a lower code unit than the low surrogate). So I think your surmise here is not quite right: this does work because .NET uses UTF-16 and lets you use individual surrogate code units encoded with \u to write arbitrary invalid UTF-16 data, which, in this case, when taken all together, happens to encode valid UTF-16. This is exactly analogous to what we do with UTF-8 in Julia.

The alternative behavior for Julia would be to look at a string literal like "\ud83d\ude3f" and note that this is a high surrogate followed by a low surrogate and treat it as if the user had written "\U1f63f". However, that feels too magical, less expressive, and inconsistent:

Too magical because it’s not what the user asked for—if they had wanted "\U1f63f" instead of two invalid code points, why didn’t they just write that?
Less expressive because it means that the user can no longer write WTF-8 data as a literal except by writing out the six bytes of the WTF-8 encoding, which is less convenient and less clear.
Less consistent because it means that "\ud83d" encodes one byte sequence—[0xed, 0xa0, 0xbd]—and "\ude3f" encodes another one—[0xed, 0xb8, 0xbf]—but "\ud83d\ude3f" would encode a totally unrelated byte sequence—[0xf0, 0x9f, 0x98, 0xbf].

This last point is really bad: if you put two complete string literals next to each other, the string they produce should be the concatenation of the strings each literal would produce independently.

BLI · August 21, 2019, 2:12pm

See Clifford Truesdell’s work with R. A. Toupin:
Truesdell, C., and Toupin, R. A. (1960). “The Classical Field Theories”, pp. 226-858 in S. Flügge (ed.): Handbuch der Physik, Vol. III/1, Springer Verlag, Berlin.

I enjoyed reading Truesdell’s thoughts on Rational Thermodynamics, and his critique of the Curie principle in irreversible thermodynamics. In the above cited reference, I seem to recall they have 11 pages of nomenclature (or did they use 11 different alphabets?) – including the standard alphabet, Gothic/Fracture, Greek, Cyrillic (?), Hebrew, etc., etc. Using all these alphabets made it rather difficult to read. I doubt using Old Hungarian runes would make it easier .

Solomon.Rutzky · September 21, 2019, 4:29pm

Hello Stefan (and @stevengj ). Thanks for that reply. I think, though, that you are misunderstanding me and that this discussion has gotten away from the original intent due to not taking into account the context of my statements. So, let me try to clarify:

I do not use Julia. It seems like a nice language with a good, involved community. But, it’s not something that I deal with. So, I have no desires for the Julia language, nor am I making any requests or even recommendations. I am not claiming that Julia should or shouldn’t do anything in particular or change anything.

Initially, since I had not yet seen the source code for the language or downloaded Julia, I had merely stated that it might be possible to produce a supplementary character via a UTF-16 surrogate pair expressed using two \u sequences (comment 11 above).

I was told that using two \u sequences to create a surrogate pair wasn’t possible (not just undesirable) due to Julia using UTF-8 internally.

From then on I have merely been trying to explain that escape sequences are not directly tied to any specific encoding. So, even if not desirable for the Julia language, it’s at least not impossible for such a construct to be supported. It’s merely a design choice, similar to how “\x” and octal escape sequence produce bytes while “\u” and “\U” sequences produce characters (here is that choice being made in the “unescape_string” function, plus line 448 for octal). While the underlying encoding might influence such choices, it does not constrain them technically. My point was (is) merely that, even if these escape sequences are specifically byte sequences, that’s only due to the intention of those who wrote the Julia parser and not necessitated by the underlying encoding. Hence, it was not absurd of me to suggest that it was at least possible that the surrogate pair construct via “\u” was supported in Julia. That is all.

Which brings us to:

Yes. That is exactly what I had suggested might be possible. It would just require updates in places such as the following:

u8_read_escape_sequence in /src/support/utf8.c (lines 336 - 369)
unescape_string in /base/strings/io.jl (lines 409 - 465)

and maybe another place (utf8proc ?), and likely also the code that calls any of those functions since they only see one sequence at a time. Again, I’m not saying that Julia should do this, only that it could. The general consensus here being that this would be a poor choice for Julia is quite fine since I was not advocating for it in the first place.

Perhaps in the context of Julia, though I am not familiar with the history of the language. I do know that some languages, such as JavaScript and T-SQL, did not originally have the ability to specify a supplementary code point and so had to use either “\u…\u…” (for JavaScript) or “NCHAR(0x…) + NCHAR(0x…)” (for T-SQL) in order to create a supplementary character. In fact, the reason I started documenting Unicode Escape Sequences across various languages (see below) was due to the great variety in approaches and not being able to remember the various nuances between the languages.

No and yes. I mean, I wouldn’t exactly call the 4-byte sequence “unrelated” since there is a simple mathematical formula to derive the code point from the surrogate pair ;-). And, it has already been stated that surrogate code points have no meaning in UTF-8 and ideally the code points wouldn’t even be encoded, so not sure who would be displeased that "\ud83d\ude3f" was encoded as “0xf09f98bf” since that is meaningful, whereas "\ud83d" by itself can’t be made meaningful so you might as well encode it as 0xeda0bd. But, this distinction might still prove to be confusing for some. Either way, this is just discussion for the sake of discussion .

FWIW, with minimal poking around I was able to find, download, and start testing with Julia. I have added Julia to my list of Unicode Escape Sequences

Take care,
Solomon…

Topic		Replies	Views
Why so complex representation of a char? General Usage	48	2774	November 15, 2018
Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540] Internals & Design	29	4809	January 24, 2018
Unicode \epsilon\_y New to Julia	33	5927	October 10, 2019
Problems with deprecations of islower, lowercase, isupper, uppercase Internals & Design	179	13979	January 1, 2018
Steven Johnson's #19847 (more verbose multi-line display for Char) Internals & Design	1	672	January 4, 2017

Stupid question on Unicode

Related topics