Substring function?

rogerkeays · February 19, 2022, 12:56pm

This thread is starting to diverge somewhat, but I’m game…

Java chars are 16-bit, and Strings are encoded internally with UTF-16, but the String class is (unsuprisingly) implemented as an array of bytes. Before 32-bit characters become a thing, that meant you could do a lot of stuff O(1). Now, I think they have the same problem as Julia. From the javadoc:

Index values refer to `char` code units, so a supplementary
character uses two positions in a `String`.

Which means, as you said, they have the same problem I posted here, except the indexing is every 16 bits instead of 8:

jshell> "🩢🩣🩤".substring(0,1)
$119 ==> "?"

jshell> "🩢🩣🩤".substring(0,2)
$120 ==> "🩢"

jshell> "🩢🩣🩤".length()
$122 ==> 6

This is worse than I thought… Julia’s functions which work on codepoints get it right:

julia> length("🩢🩣🩤")
3

And some of the newer Java functions work with variable-length encoding:

jshell> "🩢🩣🩤".codePointCount(0,6)
$137 ==> 3

jshell> "🩢🩣🩤".codePoints().forEach(System.out::println)
129634
129635
129636

jshell> "🩢🩣🩤".codePoints().forEach(c -> System.out.println(Character.toString(c)))
🩢
🩣
🩤

Basically it is a mix of new code which works on codepoints and old code which works on 16-bit indexes. A total mess, in other words. I think Julia can do better if it is consistent about using codepoints or byte indexes. I can see that, like Java, you won’t give up O(1) indexing or break backwards compatibility to do this though.

Topic		Replies	Views
Julia substring return empty string New to Julia	8	997	April 23, 2019
SubString doesn't work with unicode New to Julia question , unicode	13	1418	June 17, 2022
Counting special characters ü, å, ø, etc General Usage strings , unicode	11	734	April 1, 2022
String slicing General Usage	3	2697	October 25, 2018
Any difference between : or , in the SubString() method? New to Julia	2	276	September 24, 2020

Substring function?

Related topics