This thread is starting to diverge somewhat, but I’m game…
Java chars
are 16-bit, and Strings are encoded internally with UTF-16, but the String
class is (unsuprisingly) implemented as an array of bytes. Before 32-bit characters become a thing, that meant you could do a lot of stuff O(1). Now, I think they have the same problem as Julia. From the javadoc:
Index values refer to `char` code units, so a supplementary
character uses two positions in a `String`.
Which means, as you said, they have the same problem I posted here, except the indexing is every 16 bits instead of 8:
jshell> "🩢🩣🩤".substring(0,1)
$119 ==> "?"
jshell> "🩢🩣🩤".substring(0,2)
$120 ==> "🩢"
jshell> "🩢🩣🩤".length()
$122 ==> 6
This is worse than I thought… Julia’s functions which work on codepoints get it right:
julia> length("🩢🩣🩤")
3
And some of the newer Java functions work with variable-length encoding:
jshell> "🩢🩣🩤".codePointCount(0,6)
$137 ==> 3
jshell> "🩢🩣🩤".codePoints().forEach(System.out::println)
129634
129635
129636
jshell> "🩢🩣🩤".codePoints().forEach(c -> System.out.println(Character.toString(c)))
🩢
🩣
🩤
Basically it is a mix of new code which works on codepoints and old code which works on 16-bit indexes. A total mess, in other words. I think Julia can do better if it is consistent about using codepoints or byte indexes. I can see that, like Java, you won’t give up O(1) indexing or break backwards compatibility to do this though.