Sorting strings containing numbers, so that "A2" < "A10"?


#1

I need to sort strings containing numbers. Is there a simple way to sort them, such that, for example, "A2" < "A10"? That is, I want a lexicographic compare, such that numbers in the string are compared by value.

I guess this should be a version compare. Is it implemented in Julia, maybe in a package?


#2

You could try it along this way:

julia> list(s::AbstractString) = zip(split(s, r"[\d]+"), parse.(split(s, r"[^\d]+")))
list (generic function with 1 method)
julia> lex2(s::AbstractString) = map(x -> x[2] == nothing ? x[1] : x[2], list(s))
lex2 (generic function with 1 method)
julia> islesscustom(a,b) = begin
             for x in (zip(lex2(a), lex2(b)))
                 x[1] < x[2] && return true
             end;
             false
          end
islesscustom (generic function with 1 method)
julia> islesscustom("A20", "A3")
false

julia> islesscustom("A3", "A20")
true

#3

Depending on the flexibility needed, here is another method that is a bit faster:

function custom_cmp(x::String)
   number_idx = findfirst(isdigit, x)
   str, num = SubString(x, 1, number_idx-1), SubString(x, number_idx, length(x))
   return str, parse(Int, num)
end

l = ["A5", "A10", "A200", "A20"]

sort(l, by = custom_cmp)
@btime sort(l, lt = islesscustom)
  405.706 μs (577 allocations: 25.81 KiB)

@btime sort(l, by = custom_cmp)
  2.292 μs (33 allocations: 1.31 KiB)

#4

What if there are multiple numbers, “A10B3cd”? Ideally we should have a lexicographic comparison, that detects al the numbers and compares them by value.


#5

As I implied, you need to more clearly define exactly what you expect to put into the comparison. How should “5” vs “A” compare? How should “A53” vs “53A” etc.


#6
julia> function naturalsort(x::Vector{String})
           f = text -> all(isnumber, text) ? Char(parse(Int, text)) : text
           sorter = key -> join(f(c) for c in matchall(r"[0-9]+|[^0-9]+", key))
           sort(x, by=sorter)
       end
naturalsort (generic function with 1 method)

julia> l = ["A5a", "A10b", "A10a", "A200", "A20"];

julia> @btime naturalsort($l)
  10.196 μs (233 allocations: 6.67 KiB)
5-element Array{String,1}:
 "A5a" 
 "A10a"
 "A10b"
 "A20" 
 "A200"

with inspiration from here.


#7

I don’t know if this will help you now, but if you are the one generating the strings, a good technique is to fill the number with zeros.

The following function is useful

numstring(x, n=3) = string(x + 10^n)[2:end]

With this function,

julia> x = "A" .* numstring.([rand(1:500) for i in 1:10], 5)
10-element Array{String,1}:
 "A00077"
 "A00137"
 "A00347"
 "A00050"
 "A00386"
 "A00347"
 "A00272"
 "A00178"
 "A00169"
 "A00160"

julia> sort(x)
10-element Array{String,1}:
 "A00050"
 "A00077"
 "A00137"
 "A00160"
 "A00169"
 "A00178"
 "A00272"
 "A00347"
 "A00347"
 "A00386"

This has the added benefit when naming files, they will appear in order in the file system.


#8

You have a trick, but ls, e.g. ls -lv (can’t speak for Windows…) already has what you need to avoid naming like this:

-v natural sort of (version) numbers within text


#9

I did not know that, cool. Thanks. But is still nice to left pad the numbers with zeros

Paulo