Sorting strings containing numbers, so that "A2" < "A10"?


#1

I need to sort strings containing numbers. Is there a simple way to sort them, such that, for example, "A2" < "A10"? That is, I want a lexicographic compare, such that numbers in the string are compared by value.

I guess this should be a version compare. Is it implemented in Julia, maybe in a package?


#2

You could try it along this way:

julia> list(s::AbstractString) = zip(split(s, r"[\d]+"), parse.(split(s, r"[^\d]+")))
list (generic function with 1 method)
julia> lex2(s::AbstractString) = map(x -> x[2] == nothing ? x[1] : x[2], list(s))
lex2 (generic function with 1 method)
julia> islesscustom(a,b) = begin
             for x in (zip(lex2(a), lex2(b)))
                 x[1] < x[2] && return true
             end;
             false
          end
islesscustom (generic function with 1 method)
julia> islesscustom("A20", "A3")
false

julia> islesscustom("A3", "A20")
true

#3

Depending on the flexibility needed, here is another method that is a bit faster:

function custom_cmp(x::String)
   number_idx = findfirst(isdigit, x)
   str, num = SubString(x, 1, number_idx-1), SubString(x, number_idx, length(x))
   return str, parse(Int, num)
end

l = ["A5", "A10", "A200", "A20"]

sort(l, by = custom_cmp)
@btime sort(l, lt = islesscustom)
  405.706 μs (577 allocations: 25.81 KiB)

@btime sort(l, by = custom_cmp)
  2.292 μs (33 allocations: 1.31 KiB)

#4

What if there are multiple numbers, “A10B3cd”? Ideally we should have a lexicographic comparison, that detects al the numbers and compares them by value.


#5

As I implied, you need to more clearly define exactly what you expect to put into the comparison. How should “5” vs “A” compare? How should “A53” vs “53A” etc.


#6
julia> function naturalsort(x::Vector{String})
           f = text -> all(isnumber, text) ? Char(parse(Int, text)) : text
           sorter = key -> join(f(c) for c in matchall(r"[0-9]+|[^0-9]+", key))
           sort(x, by=sorter)
       end
naturalsort (generic function with 1 method)

julia> l = ["A5a", "A10b", "A10a", "A200", "A20"];

julia> @btime naturalsort($l)
  10.196 μs (233 allocations: 6.67 KiB)
5-element Array{String,1}:
 "A5a" 
 "A10a"
 "A10b"
 "A20" 
 "A200"

with inspiration from here.


#7

I don’t know if this will help you now, but if you are the one generating the strings, a good technique is to fill the number with zeros.

The following function is useful

numstring(x, n=3) = string(x + 10^n)[2:end]

With this function,

julia> x = "A" .* numstring.([rand(1:500) for i in 1:10], 5)
10-element Array{String,1}:
 "A00077"
 "A00137"
 "A00347"
 "A00050"
 "A00386"
 "A00347"
 "A00272"
 "A00178"
 "A00169"
 "A00160"

julia> sort(x)
10-element Array{String,1}:
 "A00050"
 "A00077"
 "A00137"
 "A00160"
 "A00169"
 "A00178"
 "A00272"
 "A00347"
 "A00347"
 "A00386"

This has the added benefit when naming files, they will appear in order in the file system.


#8

You have a trick, but ls, e.g. ls -lv (can’t speak for Windows…) already has what you need to avoid naming like this:

-v natural sort of (version) numbers within text


#9

I did not know that, cool. Thanks. But is still nice to left pad the numbers with zeros

Paulo


#10

there is also https://github.com/simonster/NaturalSort.jl, which sadly does not seem to be maintained.


#11

It’s a single 72-line function; should be easy enough to get working on 1.0 if someone has the interest. The main difficulty should be that the iteration protocol changed.


#12

If it’s a trivial function then isn’t it a good reason to (update to 1.0 and) add this capability to the standard library? And if it’s a non-trivial one (a can of worms for security), an even better argument?

http://unicode.org/reports/tr36/

3.2 Text Comparison (Sorting, Searching, Matching)

The UTF-8 exploit is a special case of a general problem. Security problems may arise where a user and a system (or two systems) compare text differently. For example, this happens where text does not compare as users expect. […]

There are some other areas to watch for. Where these are overlooked, it may leave a system open to the text comparison security problems.

  1. Normalization is context dependent; do not assume NFC(x + y) = NFC(x) + NFC(y).
  2. There are two binary Unicode orders: code point/UTF-8/UTF-32 and UTF-16 order. […]
  3. Avoid using non-Unicode charsets where possible.[…]
  4. When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems. See also [UTS22].
  5. Regular expression engines use […]

Transitivity is crucial to correct functioning of sorting algorithms. Transitivity means that if a < b and b < c then a < c. It means that there cannot be any cycles: a < b < c < a.

A lack of transitivity in string comparisons may cause security problems, including denial-of-service attacks. As an example of a failure of transitivity, consider the following pseudocode:

int compare(a,b) {
  if (isNumber(a) && isNumber(b)) {
    return numberComparison(a,b);
  } else {
    return textComparison(a,b);
  }
}

The code seems straightforward, but produces the following non-transitive result:

"12" < “12a” < “2” < "12"

For the first two comparisons, one of the values is not a number, therefore both values are compared as text. For the last two, both are numbers, and compared numerically. This breaks transitivity because a cycle is introduced.

The following pseudocode illustrates one way to repair the code, by sorting all numbers before all non-numbers:

int compare(a,b) {
  if (isNumber(a)) { if (isNumber(b)) {
    return numberComparison(a,b);
  } else {
    return -1; // a is less than b, since a is a number and b isn't }
  } else if (isNumber(b)) {
    return 1; // b is less than a, since b is a number and a isn't } else {
  return textComparison(a,b);
  }
}

Therefore, for complex comparisons, such as language-sensitive comparison, it is important to test for transitivity thoroughly.

3.3 Buffer Overflows

Some programmers may rely on limitations that are true of ASCII or Latin-1, but fail with general Unicode text. These can cause failures such as buffer overruns if the length of text grows. In particular:

  1. Strings may expand in casing: Fluß → FLUSS → fluss. The expansion factor may change depending on the UTF as well. […]

#13

Neither triviality nor non-triviality are good reasons to put something into in the base language.


#14

The pull request to fix the issue seems to exist. I didn’t test if it actually fixes the package.


#15

I’ve opened an issue to move the package to the JuliaStrings org. Then it won’t be on Simon to maintain anymore.


#16

This pull request is flawed. E.g. providing an empty strings already breaks it due to incorrect tuple decomposition of the iterate() result at the beginning.


#17

Great!

PHP is the language to beat, with natural sorting built in (and thus discoverable):
http://us3.php.net/manual/en/function.strnatcmp.php

With trivial function I meant at least it should be very discoverable, if not in the standard library, then possibly the docs should point to the relevant package.

https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

The default sort functions in almost every programming language are poorly suited for human consumption.

Most languages including made for text processing Perl and Python only have in libraries:


#18

Let’s not get carried away. “Natural sorting” is a nice feature that belongs in a package, like the zillion other specialized functions. It is rather trivial to implement, and has nothing to do with comparing languages.

Incidentally, while I like to read Atwood’s essays, it is apparent that his thinking is biased by his (excellent) work on user interfaces. This kind of sorting may be useful when displaying strings to users, but is not universally preferable to other sorting orders. Eg when comparing keys in a sorted lookup table (Base.searchsortedfirst), the actual order may be irrelevant, I just want it to be fast. Arguing that those who don’t prefer this comparison function to others are not “sane” is just needless hyperbole.


#19

To whom it may concern: for your consideration.

function natural(x, y)
    k(x) = [occursin(r"\d+", s) ? parse(Int, s) : s 
            for s in split(replace(x, r"\d+" => s->" $s "))]
    A = k(x); B= k(y)    
    for (a, b) in zip(A, B)
        if !isequal(a, b)
            return typeof(a) <: typeof(b) ? isless(a, b) :
                   isa(a,Int) ? true : false
        end
    end
    return length(A) < length(B)
end
julia> sort(["a1", "a2", "a10"])
3-element Array{String,1}:
 "a1"
 "a10"
 "a2"

julia> sort(["a1", "a2", "a10"], lt=natural)
3-element Array{String,1}:
 "a1"
 "a2"
 "a10"