1 + 'a' = 'b'

I really like this analogy (and I think it occurred to me the last time I came across this character business some time ago.)

Here’s another one: in geometry, you can subtract one Point from another Point, and get the difference as a Vector pointing from one point to the other. And you can add a Vector to a Point and get another Point. Adding two points, however, doesn’t automatically make sense.

And yet, Point and Vector normally have exactly the same encoding, so it’s not really the encoding that is causing this, but the interpretation of the quantities. You could change the coordinate system, and hence, the encoding, without invalidating the relationships.

(Vector is here different from Vector = Array{., 1}.)

5 Likes

The conversion to Int is semantically only necessary if you think of 'a' as something other than a number.

A code point, however, is a number. No conversion necessary.

That’s why the question of what a Char represents is so central. Does it represent the code point associated with a character? Or does it represent the letter we associate with a character?

For example, does 'a' represent the code point U+0061? Or does it represent the lower-case letter A?

My personal answer: Julia decides this on context. In the context 'a' * 'b' the Chars are treated as two letters and concatenated to form a String. In the context 'a' + 1, the Char is treated as a code point which can be incremented like an integer.

The context-dependant behavior is convenient—but it also leads to lots of open questions and surprising interactions.

I try to translate this:
https://docs.julialang.org/en/v1.7/base/strings/#Core.Char
to your question but I fail to produce a good answer.

Char can hold information which isn’t a Unicode codepoint. Therefor Char is not only representing a codepoint. (so we a have isvalid to check).
For your exampe I would say:

Char('a') represents the codepoint U+0061 which according to Unicode represents the glyph a.
^---- this to distinguish it from the character literal ‘a’

I am not sure how all these diffentiations of terms will help in the OPs problem of being surprised by

After all it’s a convenience to have

@inline +(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) + Int32(y))
@inline +(x::Integer, y::AbstractChar) = y + x

and not a mathematical obligation to have it defined.

Exactly. Sometimes more, sometimes less and sometimes in between like in this case.

I dont think this is correct. The non-assigned code points mentioned above can be represented by Char and also pass the isvalid test.

julia> Char(0x1FFFE) |> isvalid
true

Can you provide any example of a Char that is not “holding” a Unicode code point?

You see these to represent byte sequences in a String that is malformed UTF-8, e.g. if you try to interpret random bytes as a string ala String(rand(UInt8, 20)). You still want to be able to look at the contents of such strings, e.g. to pass them as cookies or to correct encoding errors.

julia> c = "\x94"[1]
'\x94': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> isvalid(c)
false

julia> c + 1
ERROR: Base.InvalidCharError{Char}('\x94')

(Weirdly, I’m not sure how to easily construct such characters except via strings, e.g. '\x94' gives an error. I guess you could reinterpret raw bytes.)

3 Likes

I don’t think this is necessary. You could encode Char as a Float64, and then define +(c::Char, n::Int) = Char(nextfloat(Float64(c), n)), and something similar for finding the number of eps’es between floats to do character subtraction.

Inconvenient, of course. But, the point is that +(::Char, ::Int) has meaning that is separate from the encoding of Char (like the coordinate system independence of vectors and points that I brought up.)

I’m very confused now. Wouldn’t your hypothetical nextfloat operation give a totally different result (from the current implementation)?

Yes, this requires/imagines a different encoding of Char. One where, for example,

'a': 97.0
'b': 97.00000000000001
'c': 97.00000000000003
'd': 97.00000000000004
'e': 97.00000000000006
 ⋮
'z': 97.00000000000036

I’m sure it would be horribly inconvenient, but it’s a way to demonstrate how +(::Char, ::Int) could make sense, independent of the encoding.

To me the biggest surprise in this thread is that the behavior surprises anyone, XD. From all languages I have already programmed, I think that only Haskell actually requires explicit conversion back and forth integer and characters. Not that is not a good idea to get rid of it for Julia 2.0, but it is an incredibly common convenience of programming languages.

7 Likes

It is already independent of the encoding. +(::Char, ::Int) is in terms of unicode code points not the UTF-8 encoding of those code points.

You can verify this by trying to increment an invalid char:

julia> c = "\x94"[1]
'\x94': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> c + 1
ERROR: Base.InvalidCharError{Char}('\x94')

Since an invalid char doesn’t correspond to a code point, you can’t increment it.

1 Like

Being able to define an operation algebraically is a minimum standard. But in math there are good and bad definitions, and likewise in programming. Javascript’s "a" + 1 == "a1" can also be viewed algebraically, but I think it’s a mistake for a general-purpose programming language to implement it.

1 Like

Noncharacters are still code points:

Code point: Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters.

That being said, Char can represent larger values and still do arithmetic with them, so it effectively supports a superset of Unicode codespace (in addition to invalid UTF-8 encodings):

julia> Char(0x10FFFF) + 1
'\U110000': Unicode U+110000 (category In: Invalid, too high)
3 Likes

I should have said “makes sense”, instead of “could make sense” :slightly_smiling_face:

Am I the only one here who never uses Char?

2 Likes

Really? I definitely prefer them over length-1 Strings. They just seem wrong.

Does that mean you would write

replace("Hello", "l"=>"k")

?

1 Like

Yes

1 Like

Hm. It seems wrong to me, it’s replacing characters, not strings, imho. It’s also slower.

And we’ve never had a thread about this error

julia> 1 + "a"
ERROR: MethodError: no method matching +(::Int64, ::String)
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\operators.jl:655
  +(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\int.jl:87
  +(::Integer, ::Ptr) at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\pointer.jl:161
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

(although we have had about +(String,String))

1 Like

To me, this is almost as bad as saying “scalars should be the same as 1x1 matrices” :grin:

I guess it’s less disruptive in practice, though.

1 Like

This is a very interesting discussion. I especially like the mathematical explanation by @cjdoris considering it as group action. For me the main concern seems to be that the group action is just denoted by an integer. For instance, in the example

Date(2022, 3, 16) + Day(1)

it is clear that Day(1) denotes a time interval/shift. In this interpretation

'a' + 1

should be understood and maybe explicitly written as

'a' + NextCharacterInTermsOfUnicodeCodepoints(1)

being open to suggestions for any other shorter, yet descriptive type :wink:

4 Likes