As far as I understand, the reason for this is to be able to support invalid UTF-8 in String
as well. It’s quite common to have some corrupted data that’s treated as a string. It’s usually seen as a strength to be able to do that - I wouldn’t call it a “idiosyncracy”.
You may also be interested in some of these previous discussions about various parts of the String
type in julia:
Java is using UTF-16, right? The same problems mentioned in the two links above should apply as well, as it’s a variable length encoding like UTF-8. I don’t know how java would treat those bad encodings though. I think java works around this problem by just not having strings decompose into an iterator of char
easily, which can get quite hairy to implement in a performant way (I can’t find the links to previous discussions about that though, sorry).