I really like this analogy (and I think it occurred to me the last time I came across this character business some time ago.)
Here’s another one: in geometry, you can subtract one Point from another Point, and get the difference as a Vector pointing from one point to the other. And you can add a Vector to a Point and get another Point. Adding two points, however, doesn’t automatically make sense.
And yet, Point and Vector normally have exactly the same encoding, so it’s not really the encoding that is causing this, but the interpretation of the quantities. You could change the coordinate system, and hence, the encoding, without invalidating the relationships.
(Vector is here different from Vector = Array{., 1}.)
The conversion to Int is semantically only necessary if you think of 'a' as something other than a number.
A code point, however, is a number. No conversion necessary.
That’s why the question of what a Char represents is so central. Does it represent the code point associated with a character? Or does it represent the letter we associate with a character?
For example, does 'a' represent the code point U+0061? Or does it represent the lower-case letter A?
My personal answer: Julia decides this on context. In the context 'a' * 'b' the Chars are treated as two letters and concatenated to form a String. In the context 'a' + 1, the Char is treated as a code point which can be incremented like an integer.
The context-dependant behavior is convenient—but it also leads to lots of open questions and surprising interactions.
Char can hold information which isn’t a Unicode codepoint. Therefor Char is not only representing a codepoint. (so we a have isvalid to check).
For your exampe I would say:
Char('a') represents the codepoint U+0061 which according to Unicode represents the glyph a.
^---- this to distinguish it from the character literal ‘a’
I am not sure how all these diffentiations of terms will help in the OPs problem of being surprised by
After all it’s a convenience to have
@inline +(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) + Int32(y))
@inline +(x::Integer, y::AbstractChar) = y + x
and not a mathematical obligation to have it defined.
Exactly. Sometimes more, sometimes less and sometimes in between like in this case.
You see these to represent byte sequences in a String that is malformed UTF-8, e.g. if you try to interpret random bytes as a string ala String(rand(UInt8, 20)). You still want to be able to look at the contents of such strings, e.g. to pass them as cookies or to correct encoding errors.
julia> c = "\x94"[1]
'\x94': Malformed UTF-8 (category Ma: Malformed, bad data)
julia> isvalid(c)
false
julia> c + 1
ERROR: Base.InvalidCharError{Char}('\x94')
(Weirdly, I’m not sure how to easily construct such characters except via strings, e.g. '\x94' gives an error. I guess you could reinterpret raw bytes.)
I don’t think this is necessary. You could encode Char as a Float64, and then define +(c::Char, n::Int) = Char(nextfloat(Float64(c), n)), and something similar for finding the number of eps’es between floats to do character subtraction.
Inconvenient, of course. But, the point is that +(::Char, ::Int) has meaning that is separate from the encoding of Char (like the coordinate system independence of vectors and points that I brought up.)
To me the biggest surprise in this thread is that the behavior surprises anyone, XD. From all languages I have already programmed, I think that only Haskell actually requires explicit conversion back and forth integer and characters. Not that is not a good idea to get rid of it for Julia 2.0, but it is an incredibly common convenience of programming languages.
Being able to define an operation algebraically is a minimum standard. But in math there are good and bad definitions, and likewise in programming. Javascript’s "a" + 1 == "a1" can also be viewed algebraically, but I think it’s a mistake for a general-purpose programming language to implement it.
Code point: Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters.
That being said, Char can represent larger values and still do arithmetic with them, so it effectively supports a superset of Unicode codespace (in addition to invalid UTF-8 encodings):
This is a very interesting discussion. I especially like the mathematical explanation by @cjdoris considering it as group action. For me the main concern seems to be that the group action is just denoted by an integer. For instance, in the example
Date(2022, 3, 16) + Day(1)
it is clear that Day(1) denotes a time interval/shift. In this interpretation
'a' + 1
should be understood and maybe explicitly written as
'a' + NextCharacterInTermsOfUnicodeCodepoints(1)
being open to suggestions for any other shorter, yet descriptive type