No it doesn’t, somewhere further down the sequence is ..., 'α', 'β', ...
as defined by Unicode.
I think this goes to the core of my line of argument: 'a' + 1=='b'
is implicit only if we consider 'a'
and 'b'
as representations of the letters a and b.
It is explicit if we consider 'a'
and 'b'
as representations of Unicode code points.
Types aren’t iterable. You can’t do iterate(Int, 1234)
.
Whatever it is, it’s not a cast or a promotion.
Yes, it does. You’re talking about the order, which is not the encoding. The OP was talking about interpreting 'a' + 1
as adding 1 to the encoding (in ASCII) which just happens to match up for some values, but not all - namely multibyte characters/codepoints, which don’t have this property anymore at e.g. multibyte boundaries, where the next character gets an additional byte.
I think this is unnecessarily pedantic. But okay, from now on I will say “code points”. Let’s see when the next person finds a flaw with that word choice…
On top of all of this, there are also explicit noncharacters (see wikipedia):
A small set of code points are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six of these noncharacters: U+FDD0–U+FDEF and any code point ending in the value FFFE or FFFF (i.e., U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, … U+10FFFE, U+10FFFF).
So not even the interpretation of “add 1 to the codepoint” holds.
There is nothing like we consider
We can agree here or in the docs for anything and make it explicit with that.
Still this doesn’t solve the OPs problem, that somebody considered something else and stumbled about the facts.
The important problem here is, that in some code i see:
'a' + 1
and it’s not clear what does it mean from itself. I have to consult the docs and, good luck, it is explicitly mentioned there. In this case. Well done, Julia docs people!
My opinion, in this special case, is, that we should have a function
nextchar('a')
for this, like we have nextfloat
, because 'a' + 1
is surprising and e.g. '1' + 1
is different in different languages.
But anyhow, I don’t have any strong feelings about this. But mathematical stringency will not solve this and doesn’t help.
The pragmatic path of Stefan to open a PR for 2.0 is it.
Mathematical formalism is how good programming languages are made. In Julia:
+(::T,::T)
is used for (usually commutative) group operations (e.g.1+2
or[1,1]+[2,2]
)*(::T,::T)
is used for monoid operations (in particular the multiplicative operation whenT
forms a ring) (e.g.1*2
,[1]*[2]
or"foo"*"bar"
)+(::T,::S)
is used for group actions (which generalise the notion of group operation) (e.g.'a'+1
,Ptr{Cvoid}(3)+2
or1.2+3
)-
and/
are the corresponding inverses.
Since group actions generalise group operations, you can view 1+1.0
and 1.0+1.0
both as either a group operation or an action and you get the same conclusion.
Sukera, no. Even “non characters” are Char
s:
julia> '\UFDD0'
'\ufdd0': Unicode U+FDD0 (category Cn: Other, not assigned)
Julia’s Char is not UTF8. It’s a unicode code point. That’s why it prints out the unicode code point.
Whatever it is, it’s not a cast or a promotion.
This is not important here.
That it is implicit is the reason for the arguing.
The OP was talking about interpreting
'a' + 1
as adding 1 to the encoding (in ASCII)
The OP never mentioned encodings or ASCII.
Since group actions generalise group operations, you can view
1+1.0
and1.0+1.0
both as either a group operation or an action and you get the same conclusion.
I don’t know, probably don’t understand it the right way, but than my conclusion is:
1.0 + 1.0 = 2.0
the group operation
and
1.0 + 1 = nextfloat(1.0)
the group action
for Float64
.
are not the same.
Like in
'a' + 1 = nextchar('a') = 'b'
or
next_codepoint
instead of nextchar
.
I assume everyone here is ok with Date(2022, 3, 16) + Day(1)
returning Date(2022, 3, 17)
?
This is an addition where the operands are of different types and there is no promotion.
The reason it makes sense is because there is a sensible notion of subtraction between different dates, and we can measure it in days - the difference between tomorrow and today is one day.
There is no particular need for the difference between two things to be of the same type as the thing itself. Sometimes they are the same, such as the difference between integers is also an integer.
Oftentimes they are different. The difference between dates is measured in days. The difference between cities is measured in miles. In Julia, the difference between characters is an integer.
Implicit what? I cannot be just ‘implicit’.
Julia’s Char is not UTF8. It’s a unicode code point. That’s why it prints out the unicode code point.
Yes, I’m aware! However, the section I’ve quoted is not from the UTF-8 spec - Unicode itself says these are noncharacters. The very next part of the text even acknowledges that this is often ignored:
Like surrogates, the rule that these cannot be used is often ignored, although the operation of the byte order mark assumes that U+FFFE will never be the first code point in a text.
As such, while julia can represent them, you can’t (shouldn’t?) expect to be able to do anything with them and (depending on how closely you want to follow the spec) can’t necessarily assume that it exists at all.
To be quite clear, I agree with all of you that the + 1
API is horrible and what is “meant” is “give me the next codepoint I can reasonably do something with”, even though the result may not be possible to actually encode in a valid unicode string (however encoded).
As mentioned above though, we’re stuck with it for now.
I agree with all of you that the
+ 1
API is horrible
I don’t think it is horrible. When we think of Char
s as code points, the operation makes perfect sense.
Julia’s Char can explicitly — as a design goal — hold arbitrary data including potentially invalid unicode points.
julia> Char(0x11FFFF)
'\U11ffff': Unicode U+11FFFF (category In: Invalid, too high)
Doing anything different removes your ability to programmatically fix encoding issues in your data.
Yeah, I don’t know if it’s exactly horrible. One should be able to express both something like nextchar
, and measuring distance between characters. +(::Char, ::Int)
and -(::Char, ::Char)
are a symmetric pair that is both convenient and respect algebraic manipulation.
implicit in contrast to explicit.
In
'a' + 1
there happens some implicit magic, which is hidden to the programmer who types or read this code snippet. He can not be sure, what the result is, without trying it or reading the docs. In Julia '1' + 1
is '2'
, but not in other languages.
This is meant by implicit.
The explicit version of the same result is:
Char(Int('a')+1)
where everything is explicitly written out and you can guess the outcome easily, still you can’t be sure.
I say guess and result by purpose, because it’s not the explicit code version for 'a' + 1
. It just produces the same result in the case of ‘a’ and it may fail for other characters.
The equivalent explicit code version would be:
Char(Int32('a') + Int32(1))
So, it’s not important that it is a cast, a promotion or a special function for the problem. The source of problem here is that some hidden magic (implicitly) happens.
This is often not a problem at all, as for example in
Date(2022, 3, 16) + Day(1)
as, from the types, it’s clear enough, that probably the next day as a Date
is the result.
For a Julia beginner this
"1" * "2"
can be a problem and he may has to check the docs.
Operator *
is concatenation for String
in Julia, one has to know it.
Now
'1' + 1
is what?
It’s up the individual on how to comprehend it. For some it’s clear, for some others not.
Hence
nextchar('1')
would be my suggestion and remove '1' + 1
and my argument is taste in this very special case of Char
.