If it’s a trivial function then isn’t it a good reason to (update to 1.0 and) add this capability to the standard library? And if it’s a non-trivial one (a can of worms for security), an even better argument?
http://unicode.org/reports/tr36/
3.2 Text Comparison (Sorting, Searching, Matching)
The UTF-8 exploit is a special case of a general problem. Security problems may arise where a user and a system (or two systems) compare text differently. For example, this happens where text does not compare as users expect. […]
There are some other areas to watch for. Where these are overlooked, it may leave a system open to the text comparison security problems.
- Normalization is context dependent; do not assume NFC(x + y) = NFC(x) + NFC(y).
- There are two binary Unicode orders: code point/UTF-8/UTF-32 and UTF-16 order. […]
- Avoid using non-Unicode charsets where possible.[…]
- When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems. See also [UTS22].
- Regular expression engines use […]
Transitivity is crucial to correct functioning of sorting algorithms. Transitivity means that if a < b and b < c then a < c. It means that there cannot be any cycles: a < b < c < a.
A lack of transitivity in string comparisons may cause security problems, including denial-of-service attacks. As an example of a failure of transitivity, consider the following pseudocode:
int compare(a,b) {
if (isNumber(a) && isNumber(b)) {
return numberComparison(a,b);
} else {
return textComparison(a,b);
}
}
The code seems straightforward, but produces the following non-transitive result:
“12” < “12a” < “2” < “12”
For the first two comparisons, one of the values is not a number, therefore both values are compared as text. For the last two, both are numbers, and compared numerically. This breaks transitivity because a cycle is introduced.
The following pseudocode illustrates one way to repair the code, by sorting all numbers before all non-numbers:
int compare(a,b) {
if (isNumber(a)) { if (isNumber(b)) {
return numberComparison(a,b);
} else {
return -1; // a is less than b, since a is a number and b isn't }
} else if (isNumber(b)) {
return 1; // b is less than a, since b is a number and a isn't } else {
return textComparison(a,b);
}
}
Therefore, for complex comparisons, such as language-sensitive comparison, it is important to test for transitivity thoroughly.
Some programmers may rely on limitations that are true of ASCII or Latin-1, but fail with general Unicode text. These can cause failures such as buffer overruns if the length of text grows. In particular:
- Strings may expand in casing: Fluß → FLUSS → fluss. The expansion factor may change depending on the UTF as well. […]