1 + 'a' = 'b'

DNF · March 16, 2022, 5:10pm

I really like this analogy (and I think it occurred to me the last time I came across this character business some time ago.)

Here’s another one: in geometry, you can subtract one Point from another Point, and get the difference as a Vector pointing from one point to the other. And you can add a Vector to a Point and get another Point. Adding two points, however, doesn’t automatically make sense.

And yet, Point and Vector normally have exactly the same encoding, so it’s not really the encoding that is causing this, but the interpretation of the quantities. You could change the coordinate system, and hence, the encoding, without invalidating the relationships.

(Vector is here different from Vector = Array{., 1}.)

josuagrw · March 16, 2022, 5:16pm

The conversion to Int is semantically only necessary if you think of 'a' as something other than a number.

A code point, however, is a number. No conversion necessary.

That’s why the question of what a Char represents is so central. Does it represent the code point associated with a character? Or does it represent the letter we associate with a character?

For example, does 'a' represent the code point U+0061? Or does it represent the lower-case letter A?

My personal answer: Julia decides this on context. In the context 'a' * 'b' the Chars are treated as two letters and concatenated to form a String. In the context 'a' + 1, the Char is treated as a code point which can be incremented like an integer.

The context-dependant behavior is convenient—but it also leads to lots of open questions and surprising interactions.

oheil · March 16, 2022, 5:37pm

I try to translate this:
https://docs.julialang.org/en/v1.7/base/strings/#Core.Char
to your question but I fail to produce a good answer.

Char can hold information which isn’t a Unicode codepoint. Therefor Char is not only representing a codepoint. (so we a have isvalid to check).
For your exampe I would say:

Char('a') represents the codepoint U+0061 which according to Unicode represents the glyph a.
^---- this to distinguish it from the character literal ‘a’

I am not sure how all these diffentiations of terms will help in the OPs problem of being surprised by

After all it’s a convenience to have

@inline +(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) + Int32(y))
@inline +(x::Integer, y::AbstractChar) = y + x

and not a mathematical obligation to have it defined.

Exactly. Sometimes more, sometimes less and sometimes in between like in this case.

josuagrw · March 16, 2022, 5:41pm

I dont think this is correct. The non-assigned code points mentioned above can be represented by Char and also pass the isvalid test.

julia> Char(0x1FFFE) |> isvalid
true

Can you provide any example of a Char that is not “holding” a Unicode code point?

stevengj · March 16, 2022, 5:55pm

You see these to represent byte sequences in a String that is malformed UTF-8, e.g. if you try to interpret random bytes as a string ala String(rand(UInt8, 20)). You still want to be able to look at the contents of such strings, e.g. to pass them as cookies or to correct encoding errors.

julia> c = "\x94"[1]
'\x94': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> isvalid(c)
false

julia> c + 1
ERROR: Base.InvalidCharError{Char}('\x94')

(Weirdly, I’m not sure how to easily construct such characters except via strings, e.g. '\x94' gives an error. I guess you could reinterpret raw bytes.)

DNF · March 16, 2022, 6:00pm

I don’t think this is necessary. You could encode Char as a Float64, and then define +(c::Char, n::Int) = Char(nextfloat(Float64(c), n)), and something similar for finding the number of eps’es between floats to do character subtraction.

Inconvenient, of course. But, the point is that +(::Char, ::Int) has meaning that is separate from the encoding of Char (like the coordinate system independence of vectors and points that I brought up.)

josuagrw · March 16, 2022, 6:02pm

I’m very confused now. Wouldn’t your hypothetical nextfloat operation give a totally different result (from the current implementation)?

DNF · March 16, 2022, 6:03pm

Yes, this requires/imagines a different encoding of Char. One where, for example,

'a': 97.0
'b': 97.00000000000001
'c': 97.00000000000003
'd': 97.00000000000004
'e': 97.00000000000006
 ⋮
'z': 97.00000000000036

I’m sure it would be horribly inconvenient, but it’s a way to demonstrate how +(::Char, ::Int) could make sense, independent of the encoding.

Henrique_Becker · March 16, 2022, 6:16pm

To me the biggest surprise in this thread is that the behavior surprises anyone, XD. From all languages I have already programmed, I think that only Haskell actually requires explicit conversion back and forth integer and characters. Not that is not a good idea to get rid of it for Julia 2.0, but it is an incredibly common convenience of programming languages.

cjdoris · March 16, 2022, 6:19pm

It is already independent of the encoding. +(::Char, ::Int) is in terms of unicode code points not the UTF-8 encoding of those code points.

You can verify this by trying to increment an invalid char:

julia> c = "\x94"[1]
'\x94': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> c + 1
ERROR: Base.InvalidCharError{Char}('\x94')

Since an invalid char doesn’t correspond to a code point, you can’t increment it.

jar1 · March 16, 2022, 6:25pm

Being able to define an operation algebraically is a minimum standard. But in math there are good and bad definitions, and likewise in programming. Javascript’s "a" + 1 == "a1" can also be viewed algebraically, but I think it’s a mistake for a general-purpose programming language to implement it.

stevengj · March 16, 2022, 6:33pm

Noncharacters are still code points:

Code point: Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters.

That being said, Char can represent larger values and still do arithmetic with them, so it effectively supports a superset of Unicode codespace (in addition to invalid UTF-8 encodings):

julia> Char(0x10FFFF) + 1
'\U110000': Unicode U+110000 (category In: Invalid, too high)

DNF · March 16, 2022, 6:35pm

I should have said “makes sense”, instead of “could make sense”

tbeason · March 16, 2022, 6:44pm

Am I the only one here who never uses Char?

DNF · March 16, 2022, 6:48pm

Really? I definitely prefer them over length-1 Strings. They just seem wrong.

Does that mean you would write

replace("Hello", "l"=>"k")

?

tbeason · March 16, 2022, 6:51pm

Yes

DNF · March 16, 2022, 6:53pm

Hm. It seems wrong to me, it’s replacing characters, not strings, imho. It’s also slower.

tbeason · March 16, 2022, 6:55pm

And we’ve never had a thread about this error

julia> 1 + "a"
ERROR: MethodError: no method matching +(::Int64, ::String)
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\operators.jl:655
  +(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\int.jl:87
  +(::Integer, ::Ptr) at C:\Users\beasont\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\pointer.jl:161
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[1]:1

(although we have had about +(String,String))

DNF · March 16, 2022, 7:03pm

To me, this is almost as bad as saying “scalars should be the same as 1x1 matrices”

I guess it’s less disruptive in practice, though.

bertschi · March 16, 2022, 8:32pm

This is a very interesting discussion. I especially like the mathematical explanation by @cjdoris considering it as group action. For me the main concern seems to be that the group action is just denoted by an integer. For instance, in the example

Date(2022, 3, 16) + Day(1)

it is clear that Day(1) denotes a time interval/shift. In this interpretation

'a' + 1

should be understood and maybe explicitly written as

'a' + NextCharacterInTermsOfUnicodeCodepoints(1)

being open to suggestions for any other shorter, yet descriptive type

Topic		Replies	Views
X[a,1]'b[a]==sum(x[a,1].b[a])? General Usage	2	283	June 7, 2021
Using an `Int` as a `Num` General Usage symbolics	4	295	February 28, 2024
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	767	December 5, 2020
Overloading ≫ yields ERROR: TypeError: non-boolean used in boolean context General Usage question	3	432	July 6, 2021
Why a === b returns TRUE when a and b are single elements (numbers or strings) New to Julia question	6	588	October 10, 2021

1 + 'a' = 'b'

Related topics