Tab completion for umlauts, and some musings about unicode characters


#1

I’ve found umlauts (ä, ö, ü, etc) to be missing from the tab completions in the REPL, and the associated editor plugins. Other accented Latin characters are available, e.g. \aa => å or \o => ø. The need for umlauts is mostly so that I can type out “Schrödinger equation” on devices where I can’t customize my keyboard layout easily (my iPad). One might want to extend this to other diaeresis characters beyond umlauts, e.g. ë (trema), or other accented latin characters that might be missing.

My natural inclination would be to use \"a => ä, as this is what actual LaTeX uses. I’ve tried to do this in PR #26181, but it seems that mappings containing a quote just don’t work in the REPL. So my main question would be why this is, i.e., whether the REPL completion could be implemented in such a way that \"a<tab> would work. Alternatively, what other tab completions would be suitable for umlauts and other accented latin characters?

@ScottPJones reminded me that \ddot is available, but there’s a few problems with using that:

  • In LaTeX, \ddot{a} is math-mode only, and it is semantically different from \"a. The former would likely be used for a second derivative, while the latter is an umlaut.
  • Typing a\ddot<tab> in Julia produces the unicode character a, followed by the unicode character “combining diaeresis”. This is different from the NFC-normalized single unicode-character “a with diaeresis” that would be preferred for an umlaut. My first reaction to this was that the REPL tab completions should always produce NFC normalized characters, but then I realized that this would be difficult to implement, and more importantly, the semantic difference of umlauts vs “math accent” is quite useful. While generally I think of mixed unicode normalizations as a problem, I could actually imagine having both of these in a Julia source code file!

I think one way to deal with problems of unicode normalization, or hard-to-distinguish unicode characters (mu vs micro) would be to have a defined set of “known” or “canonical” unicode characters in Julia. These would be all those that have tab completions in the REPL. For the specific case of umlaut/trema vs double-dot, both unicode forms should be available to reflect the semantic difference.

There should be a linter, and maybe even an optional compiler warning when processing source code files that complains if there are any unicode characters in the source code (including in strings) that are not in Julia’s list of canonical characters. I’ve certainly had this problem with phi, where \phi becomes ϕ (‘phi symbol’), whereas I’ve accidentally inserted the almost identical φ (‘greek small letter phi’) that I have available directly on my keyboard. In such a case, ‘phi symbol’ should be the canonical character, and I’d like to be alerted about the presence of the “unknown” ‘greek small letter phi’.

Is this something people would consider as sensible?


#2

As far as I know though, there is no separate Unicode characters for the distinction that you are making between \ddot{a} vs. \"{a}, nor is there a distinction in the Unicode tables for LaTeX entities:

      <character id="U000C4" dec="196" mode="mixed" type="other">
         <unicodedata category="Lu" combclass="0" bidi="L" decomp="0041 0308" mirror="N" unicode1="LATIN CAPITAL LETTER A DIAERESIS" lower="00E4"/>
         <latex>\"{A}</latex>
         <mathlatex>\ddot{A}</mathlatex>
         <Wolfram>CapitalADoubleDot</Wolfram>
         <entity id="Auml" set="xhtml1-lat1" optional-semi="yes">
            <desc>latin capital letter A with diaeresis</desc>
         </entity>
         <entity id="Auml" set="8879-isolat1">
            <desc>=capital A, dieresis or umlaut mark</desc>
         </entity>
         <entity id="Auml" set="9573-2003-isolat1">
            <desc>=capital A, dieresis or umlaut mark</desc>
         </entity>
         <description unicode="1.1">LATIN CAPITAL LETTER A WITH DIAERESIS</description>
      </character>

Remember, the “LaTeX” shortcuts is mostly a convenience for the REPL and editors (I also use it in my StringLiterals package, which in turn uses the LaTeX_Entities package for “Julian” LaTeX entities support).
For my package, I allowed in the ddot{A} form as a shortcut (and others like grave{A}, hat{A}, etc.), which seemed like it would be easy enough to use (and remember).

I wouldn’t be two concerned about exactly what characters the shortcuts produce for Julia, as long as the different variants can be produced, and it is easy enough to recall.

I like the idea of a lint mode (or editor command) to quickly find all of the embedded Unicode characters in a program :slight_smile:


#3

Most operating systems already have easy ways to type Latin accents, often even without switching to a different language’s keyboard layout. It’s a good idea to learn how to do this in your OS, in addition to anything REPL-specific, in order to type words and names from non-English Latin-script languages.

On the iPad and iPhone on-screen keyboard, for example, if you simply hold down on the “o” key, you will see a pop-up menu of accents. In MacOS, you can type “ö” by “option-u o”. In Linux, with an appropriate desktop setting, you can type compose-" o. In Windows, you type alt-0246 (ugh!), but there are add-on packages to make it more Linux-like.

Note that ö typed by the above produces U+00F6, whereas o followed by tab-completion of \ddot in the REPL produces the o character followed by the U+0308 combining character. However, this combination is canonically equivalent to U+00F6 according to the Unicode standard, and in fact is treated as equivalent in Julia’s parser (because identifiers are “normalized”).


#4

Yeah, you’re right. In most cases, I do use direct input. The one situation where that’s still inconvenient is in iVim or Blink shell on iOS (with a Magic Keyboard), where I’d have to rely on vim digraphs to enter unicode. Since I’ve never really been able to digraphs into muscle memory, I really like to use the latex-to-unicode plugin. That’s actually my real motivation for having a better way to type umlauts, but I think I went a bit overboard. This really doesn’t have be be in Julia, I’d just have to extend the vim plugin for personal use, which is very easy to do.

On further reflection, I also realize that it wouldn’t really make sense for Julia to warn about all unicode symbols that don’t have latex-abbreviations. People might very well have Chinese or Cyrillic letters in their strings, and those are always going to be entered directly. That being said, there would still be room for Julia to spit out warnings for things that are very likely errors. But it would have to accept all “known” symbols as well as all non-ambiguous letters from standard languages (i.e., most of unicode). So it’s probably better to have a blacklist for symbols that create warnings, which would include the “wrong” phi’s or mu’s.


#5

Yes, you’re right in that the unicode standard considers the two variants equivalent. And, for identifiers, Julia uses normalization (as @stevengj pointed out). Text editors will generally not normalize, though (which can lead to problems for search/and replace), nor will normalization happen within strings.

I’m definitely doing something unsanctioned by assigning a different semantics to \ddot{u} (second derivative) vs u-umlaut, but in this case, it seems quite compelling :wink:

In any case, the current implementation does the right thing: \ddot does the combining diaeresis, and umlauts are best typed with direct input (and I can set up my editor to do that in a convenient way, without requiring Julia adopts that way too)


#6

In Emacs, I use

(add-to-list 'write-file-functions ...)

to delete trailing whitespace and normalize Unicode (ucs-normalize-...) every time I save a file. I imagine vim has something equivalent.


#7

Sorry I have fever, so maybe it was told yet - I have no power to read whole text.

You could do for example this:

Base.REPLCompletions.latex_symbols["\\..a"] = "ä"

and then \..a<tab> works for you!

But this doesn’t work: Base.REPLCompletions.latex_symbols["\\\"a"] = "ä" I supose that " is parsed specially.

You could add similar lines into your juliarc.jl (be aware it was renamed to startup.jl in master!).