Yes, I fully agree.
You can call it encoding if you like. It does not change the following:
When we do 'a' + 1
we are manipulating the encoding not the letter itself. Because it does not make sense to increment a letter. It does make sense to increment an encoding.
In that perspective, you can never manipulate any object in and of itself, only its encoding. Then everything is a ‘pointer’. I don’t see how this is useful.
Anyway, my point was that ‘pointer’ and ‘encoding’ are not the same, and not really useful to conflate.
Well, ‘b’ comes after ‘a’, so why not?
BTW: I’m not arguing that 'a' + 1
ought to work, I’m not sure what I think about that.
Because this logic extends well beyond the alphabet. After 9 comes :
julia> '9'+1
':': ASCII/Unicode U+003A (category Po: Punctuation, other)
This cannot be explained without reference to encoding.
I have to look at the encoding to understand why I get this specific result.
I disagree. When I do
julia> "hello" * " world"
"hello world"
I am not operating on the level of the encoding of the strings. It’s completely hidden from me. I am purely operating with the values which are encoded, namely content of the two strings.
The implementation details of the encoding do not determine the result here.
If someone wants this to change in Julia 2.0 they should open an issue about it on GitHub.
This is consistent with the lexicographic order of characters, independent of their encoding.
Anyway, as I said, I don’t care too much about whether Char + Int
should or shouldn’t work, but that characters aren’t pointers, and aren’t like pointers.
I was addressing your conflation of ‘encoding’ and ‘pointer’. This is getting really far afield.
Sure it can - a Char
is not a number, so what did you expect to happen instead? Having '10'
? The choice of “next character” is arbitrary and + 1
(though implicit casting) happens to be the choice of syntax for “give me the next character”. The choice of what that next character is doesn’t have to make any semantic sense, as there is no such thing in general.
I think having 'a' + 1
work makes sense. When I type 'a'
(or any Char
) at the REPL and hit enter, it displays a number, e.g., U+0061
in
julia> 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
Once I see that, it makes sense that adding an integer to a Char
will change that number by the corresponding amount, e.g., U+0161
is the result of adding 0x100
to 'a'
:
julia> 'a' + 0x100
'š': Unicode U+0161 (category Ll: Letter, lowercase)
I think that might be @josuagrw’s point: The fact that '9' + 1
results in ':'
is because 1
is added to the ASCII/Unicode encoding of '9'
.
This is nothing to do with conversion between integers and characters. If 'a'+1
were promoting both to an Int
the result would be an Int
and if it were promoting to Char
then it would fail because +(::Char,::Char)
is not defined.
A Char
represents a Unicode character. Unicode characters form an ordered sequence (defined by the Unicode spec), part of which is ..., 'a', 'b', 'c', ...
and so 'a'+1
gets the next item in that sequence. This is completely analogous to:
- Pointers: these form a sequence
..., Ptr{Cvoid}(3), Ptr{Cvoid}(4), ...
and soPtr{Cvoid}(3)+1
gets the next item in the sequence. - Integers: these form a sequence
..., 5, 6, ...
and so5+1
gets the next item in the sequence.
These types being an ordered sequence has other useful semantics, like ordering ('a'<'b'
, Ptr{Cvoid}(3)<Ptr{Cvoid}(4)
and 5-6
are all true) and differencing ('b'-'a'
, Ptr{Cvoid}(4)-Ptr{Cvoid}(3)
and 6-5
are all 1).
That is exactly my point, thank you for translating.
For the mathematically inclined, +(::Char,::Int)
forms a group action of the (additive) group of integers on the set of characters.
Similarly +(::Ptr{T},::Int)
is a group action of integers on pointers and +(::Int,::Int)
is the usual group action of integers on themselves, namely the (additive) group operation on integers. So this is all consistent and legit from a mathematical point of view.
Yes, I’m aware of that. My point is that this has nothing to do with the encoding, as that argument breaks down as soon as you have multibyte characters (same goes for the pointer argument!). That’s why I’ve been very careful to say “next character” instead (which noone has mentioned so far, as far as I can tell).
Viewed from that perspective, one could also argue that iterate(Char, 1234)
should give the 1234th character in UTF-8, though I think hardly anyone would think that as a sensible API either.
The point made by @kristoffer.carlsson and @StefanKarpinski above still stands - this is the API for “next character” we currently have and we can’t get rid of it until 2.0. The problem is known and issues for it exist, so arguing about it being horrible/bad for students/whatever else is futile, as we have to live with it for now.
There’s also subtraction defined between characters to give you the integer difference in code points. This can be useful, if cutesy:
julia> rot(str, n) = String((collect(str) .- 'a' .+ n) .% 26 .+ 'a')
rot (generic function with 1 method)
julia> rot("hello", 13)
"uryyb"
which doesn’t really help for the original question.
But it generates more questions, like: should we than treat Float64 as sequence in this mathematical sense and distinguish between 1.0 + 1
and 1.0 + 1.0
. I don’t think so.
The question is about implicit casts, are they error prone or can they be tolerated in special circumstances.
What happens when you add 1 to a multibyte character?
This isn’t an implicit cast, nor is it promotion (like your numeric 1
vs 1.0
example). It’s just that we’ve chosen a meaning for what happens when you add an integer and a char together.
It is implicit.
Admitted not a cast, but still implicit. And not explicitly written by the user/programmer as in Char(Int('a')+1)
.