I really try to understand this but sorry what I see is just a declaration and not explanation.
Yes, that was just my summary of the conclusion. The explanation is above in this thread.
I could try (with my poor English so sorry it won’t be optimal) to explain why I think that AbstractString and encodings has to allow separation from Unicode.
There is nothing about the string interface that dictates Unicode except that AbstractString
is an encoding of a sequence of code points – and the meanings of code points are given by Unicode. This is not really a limitation, however, since Unicode is a superset of all the other character sets you might care to use. You can implement string types for any encoding you want to.
I see more reasons but we could focus on performance here. If I have CP-1250 for example then all chars are bytes there is not necessary to have O(N) algorithm to search n-th character. Uppercase could be implemented as simple translate table <-> function from UInt8 to UInt8 (or simple Vector{UInt8}(256) index is variable and value is value). No locale dependencies has to be checked. I see reason to not slowdown this function with Unicode complications.
You can absolutely implement that transformation. But then you are – by definition – not implementing the Unicode.uppercase
function. You can call the function that implements this CP1250.uppercase
or something like that. If you know that you have CP-1250 strings and you really want to have very fast case transformations and are ok with the non-standard case mapping it implements, then you can use this function. But that’s a lot of "if"s. When someone else writes generic string code and asks for Unicode.uppercase
, it would not be ok if the result they got did not do the expected standard Unicode case transformation. That is why the Unicode.uppercase
and CP1250.uppercase
functions should be separate.
If you want a function that does some kind of uppercasing and you’re not too fussy about exactly how, so long as the result looks reasonably uppercaseish, then you can define something like this in your code:
uppercase(s::CP1250.String) = CP1250.uppercase(s)
uppercase(s::AbstractString) = Unicode.uppercase(s)
Now you have an uppercase
function that is fast for CP-1250 and has a reasonable fallback for other kinds of strings. You can add as many methods to this function as you want to for specialized ways of uppercasing particular string types.
I really suppose that I could something miss but what is reason to change this simplicity with ambiguous and slow Unicode functions?
The Unicode functions are the opposite of ambiguous – they are carefully specified and standardized. They’re also not especially slow, and I have yet to see a situation where string case transformations are performance-critical. But if you find yourself in such a case, you can always use the CP1250.uppercase
function.
IMHO if there is need to have Unicode “nonviolent” design then why not to have AbstractUnicodeString in string type hierarchy?
There is no need for this since AbstractString
already supports all kinds of encodings.