Sorting strings containing numbers, so that "A2" < "A10"?

e3c6 · August 14, 2017, 9:37am

I need to sort strings containing numbers. Is there a simple way to sort them, such that, for example, "A2" < "A10"? That is, I want a lexicographic compare, such that numbers in the string are compared by value.

I guess this should be a version compare. Is it implemented in Julia, maybe in a package?

klacru · August 14, 2017, 10:21am

You could try it along this way:

julia> list(s::AbstractString) = zip(split(s, r"[\d]+"), parse.(split(s, r"[^\d]+")))
list (generic function with 1 method)
julia> lex2(s::AbstractString) = map(x -> x[2] == nothing ? x[1] : x[2], list(s))
lex2 (generic function with 1 method)
julia> islesscustom(a,b) = begin
             for x in (zip(lex2(a), lex2(b)))
                 x[1] < x[2] && return true
             end;
             false
          end
islesscustom (generic function with 1 method)
julia> islesscustom("A20", "A3")
false

julia> islesscustom("A3", "A20")
true

kristoffer.carlsson · August 14, 2017, 12:33pm

Depending on the flexibility needed, here is another method that is a bit faster:

function custom_cmp(x::String)
   number_idx = findfirst(isdigit, x)
   str, num = SubString(x, 1, number_idx-1), SubString(x, number_idx, length(x))
   return str, parse(Int, num)
end

l = ["A5", "A10", "A200", "A20"]

sort(l, by = custom_cmp)

@btime sort(l, lt = islesscustom)
  405.706 μs (577 allocations: 25.81 KiB)

@btime sort(l, by = custom_cmp)
  2.292 μs (33 allocations: 1.31 KiB)

e3c6 · August 14, 2017, 1:11pm

What if there are multiple numbers, “A10B3cd”? Ideally we should have a lexicographic comparison, that detects al the numbers and compares them by value.

kristoffer.carlsson · August 14, 2017, 1:23pm

As I implied, you need to more clearly define exactly what you expect to put into the comparison. How should “5” vs “A” compare? How should “A53” vs “53A” etc.

pfitzseb · August 14, 2017, 1:42pm

julia> function naturalsort(x::Vector{String})
           f = text -> all(isnumber, text) ? Char(parse(Int, text)) : text
           sorter = key -> join(f(c) for c in matchall(r"[0-9]+|[^0-9]+", key))
           sort(x, by=sorter)
       end
naturalsort (generic function with 1 method)

julia> l = ["A5a", "A10b", "A10a", "A200", "A20"];

julia> @btime naturalsort($l)
  10.196 μs (233 allocations: 6.67 KiB)
5-element Array{String,1}:
 "A5a" 
 "A10a"
 "A10b"
 "A20" 
 "A200"

with inspiration from here.

Paulo_Jabardo · August 14, 2017, 2:26pm

I don’t know if this will help you now, but if you are the one generating the strings, a good technique is to fill the number with zeros.

The following function is useful

numstring(x, n=3) = string(x + 10^n)[2:end]

With this function,

julia> x = "A" .* numstring.([rand(1:500) for i in 1:10], 5)
10-element Array{String,1}:
 "A00077"
 "A00137"
 "A00347"
 "A00050"
 "A00386"
 "A00347"
 "A00272"
 "A00178"
 "A00169"
 "A00160"

julia> sort(x)
10-element Array{String,1}:
 "A00050"
 "A00077"
 "A00137"
 "A00160"
 "A00169"
 "A00178"
 "A00272"
 "A00347"
 "A00347"
 "A00386"

This has the added benefit when naming files, they will appear in order in the file system.

Palli · August 15, 2017, 7:48pm

You have a trick, but ls, e.g. ls -lv (can’t speak for Windows…) already has what you need to avoid naming like this:

-v natural sort of (version) numbers within text

Paulo_Jabardo · August 16, 2017, 5:54pm

I did not know that, cool. Thanks. But is still nice to left pad the numbers with zeros

Paulo

bjarthur · February 5, 2019, 6:50pm

there is also https://github.com/simonster/NaturalSort.jl, which sadly does not seem to be maintained.

StefanKarpinski · February 5, 2019, 11:19pm

It’s a single 72-line function; should be easy enough to get working on 1.0 if someone has the interest. The main difficulty should be that the iteration protocol changed.

Palli · February 6, 2019, 1:35pm

If it’s a trivial function then isn’t it a good reason to (update to 1.0 and) add this capability to the standard library? And if it’s a non-trivial one (a can of worms for security), an even better argument?

http://unicode.org/reports/tr36/

3.2 Text Comparison (Sorting, Searching, Matching)

The UTF-8 exploit is a special case of a general problem. Security problems may arise where a user and a system (or two systems) compare text differently. For example, this happens where text does not compare as users expect. […]

There are some other areas to watch for. Where these are overlooked, it may leave a system open to the text comparison security problems.

Normalization is context dependent; do not assume NFC(x + y) = NFC(x) + NFC(y).

There are two binary Unicode orders: code point/UTF-8/UTF-32 and UTF-16 order. […]

Avoid using non-Unicode charsets where possible.[…]

When converting charsets, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems. See also [UTS22].

Regular expression engines use […]

Transitivity is crucial to correct functioning of sorting algorithms. Transitivity means that if a < b and b < c then a < c. It means that there cannot be any cycles: a < b < c < a.

A lack of transitivity in string comparisons may cause security problems, including denial-of-service attacks. As an example of a failure of transitivity, consider the following pseudocode:
int compare(a,b) {
  if (isNumber(a) && isNumber(b)) {
    return numberComparison(a,b);
  } else {
    return textComparison(a,b);
  }
}
The code seems straightforward, but produces the following non-transitive result:

“12” < “12a” < “2” < “12”

For the first two comparisons, one of the values is not a number, therefore both values are compared as text. For the last two, both are numbers, and compared numerically. This breaks transitivity because a cycle is introduced.

The following pseudocode illustrates one way to repair the code, by sorting all numbers before all non-numbers:
int compare(a,b) {
  if (isNumber(a)) { if (isNumber(b)) {
    return numberComparison(a,b);
  } else {
    return -1; // a is less than b, since a is a number and b isn't }
  } else if (isNumber(b)) {
    return 1; // b is less than a, since b is a number and a isn't } else {
  return textComparison(a,b);
  }
}
Therefore, for complex comparisons, such as language-sensitive comparison, it is important to test for transitivity thoroughly.

3.3 Buffer Overflows

Some programmers may rely on limitations that are true of ASCII or Latin-1, but fail with general Unicode text. These can cause failures such as buffer overruns if the length of text grows. In particular:

Strings may expand in casing: Fluß → FLUSS → fluss. The expansion factor may change depending on the UTF as well. […]

StefanKarpinski · February 6, 2019, 2:13pm

Neither triviality nor non-triviality are good reasons to put something into in the base language.

Tero_Frondelius · February 6, 2019, 2:53pm

The pull request to fix the issue seems to exist. I didn’t test if it actually fixes the package.

StefanKarpinski · February 6, 2019, 3:17pm

I’ve opened an issue to move the package to the JuliaStrings org. Then it won’t be on Simon to maintain anymore.

laborg · February 6, 2019, 4:01pm

This pull request is flawed. E.g. providing an empty strings already breaks it due to incorrect tuple decomposition of the iterate() result at the beginning.

Palli · February 10, 2019, 3:59pm

Great!

PHP is the language to beat, with natural sorting built in (and thus discoverable):
http://us3.php.net/manual/en/function.strnatcmp.php

With trivial function I meant at least it should be very discoverable, if not in the standard library, then possibly the docs should point to the relevant package.

The default sort functions in almost every programming language are poorly suited for human consumption.

Most languages including made for text processing Perl and Python only have in libraries:

Tamas_Papp · February 10, 2019, 4:27pm

Let’s not get carried away. “Natural sorting” is a nice feature that belongs in a package, like the zillion other specialized functions. It is rather trivial to implement, and has nothing to do with comparing languages.

Incidentally, while I like to read Atwood’s essays, it is apparent that his thinking is biased by his (excellent) work on user interfaces. This kind of sorting may be useful when displaying strings to users, but is not universally preferable to other sorting orders. Eg when comparing keys in a sorted lookup table (Base.searchsortedfirst), the actual order may be irrelevant, I just want it to be fast. Arguing that those who don’t prefer this comparison function to others are not “sane” is just needless hyperbole.

jandehaan · February 10, 2019, 6:34pm

To whom it may concern: for your consideration.

function natural(x, y)
    k(x) = [occursin(r"\d+", s) ? parse(Int, s) : s 
            for s in split(replace(x, r"\d+" => s->" $s "))]
    A = k(x); B= k(y)    
    for (a, b) in zip(A, B)
        if !isequal(a, b)
            return typeof(a) <: typeof(b) ? isless(a, b) :
                   isa(a,Int) ? true : false
        end
    end
    return length(A) < length(B)
end

julia> sort(["a1", "a2", "a10"])
3-element Array{String,1}:
 "a1"
 "a10"
 "a2"

julia> sort(["a1", "a2", "a10"], lt=natural)
3-element Array{String,1}:
 "a1"
 "a2"
 "a10"

evad · July 11, 2020, 7:11am

Tried out, worked like magic.
Can you explain what does A = k(x) mean here?

Topic		Replies	Views
What's wrong with this ordering function? General Usage sort , sorting	5	277	August 24, 2023
WIP: faster string sort Internals & Design strings , sort	92	12410	February 8, 2018
Meaning of `isless`, `<`, `cmp`, `isequal`, `==`, total order, partial order, unordered Internals & Design	24	2320	January 7, 2018
Understanding issorted's lt keyword New to Julia	88	2827	May 14, 2021
Problem processing non utf8 string New to Julia	17	2170	June 1, 2018

Sorting strings containing numbers, so that "A2" < "A10"?

3.2 Text Comparison (Sorting, Searching, Matching)

3.3 Buffer Overflows

Related topics