Get shortest/longest string in array

What is the best way to retrieve the shortest or longest string in an array? If there are multiple candidates having the same minimum/maximum length, it’s okay to return just the first one.

This can be solved with the snippet below. Is there a better way to do it, though? Thanks in advance.

arr = ["hello", "beautiful", "world", "!"]
minl, maxl = extrema(length, arr)

shortest = filter(x -> length(x) == minl, arr) |> first
longest = filter(x -> length(x) == maxl, arr) |> first

# or even...
shortest = arr[findfirst(x -> length(x) == minl, arr)]
longest = arr[findfirst(x -> length(x) == maxl, arr)]
1 Like

I would certainly use findfirst rather than filter here (since the latter creates a new array unnecessarily). But you could also easily write a function to do this in a single pass:

function extrema_elements(f, itr)
    i = iterate(itr)
    i === nothing && throw(ArgumentError("iterator must be nonempty"))
    xmin = xmax = i[1]
    vmin = vmax = f(xmin)
    while true
        i = iterate(itr, i[2])
        i === nothing && return (xmin, xmax)
        x = i[1]; v = f(x)
        if v < vmin
            xmin, vmin = x, v
        elseif v > vmax
            xmax, vmax = x, v
        end
    end
end

giving:

julia> extrema_elements(length, arr)
("!", "beautiful")
3 Likes

I have a feeling this can be done elegantly with Transducers.jl based on this example

1 Like

Thank you for the prompt reply, Professor. The custom function you suggest is great, especially for other collections besides arrays. For example, if instead of an array of strings our input was a set of strings, the filtering from my original post would work, and so would the findfirst, but it would require to first collect the set to an array; however, your function would work regardless and without unnecessary copies! :slight_smile:

Note also that there is some ambiguity in the definition of “length”:

  • length(s) is the number of Unicode characters, which might not be what you expect if the text can contain combining characters like diacritical marks, e.g. length("naïve") == 5 but length("naïve") == 6. (If this confuses you, compare collect("naïve") to collect("naïve") and read about Unicode equivalence.)

  • sizeof(s) is quicker to compute than length (it is O(1) rather than O(n) where n is the length of s) and gives the size in bytes. (This equals the length for ASCII strings, but not for other characters; for example, sizeof("🐨") == 4 because the UTF-8 encoding of this emoji character requires 4 bytes. And sizeof("naïve") == 6 while sizeof("naïve") == 7!)

  • textwidth(s) gives the length of the string as measured in terminal character cells (roughly), so that textwidth("naïve") == 5. However, textwidth("🐨") == 2 because emoji are typically displayed as taking up 2 terminal cells.

  • length(Unicode.graphemes(s)) gives the number of Unicode “graphemes” in s, which more closely corresponds to what readers would typically consider “characters”. e.g. length(Unicode.graphemes("naïve")) == 5 and length(Unicode.graphemes("🐨")) == 1. But see the next point:

  • The number of on-screen symbols (“glyphs”) is ultimately determined by the font and text-rendering system. For example, the FiraCode font renders the ASCII string "->" (2 characters, 2 bytes, 2 graphemes, textwidth 2) as a single glyph via a ligature.

7 Likes

You could also use foldl:

function extrema_elements(f, itr)
    x = first(itr); v = f(x)
    val = foldl(itr, init=(x,v,x,v)) do minmax, x
        v = f(x)
        xmin, vmin, xmax, vmax = minmax
        if v < vmin
            xmin, vmin = x, v
        elseif v > vmax
            xmax, vmax = x, v
        end
        (xmin, vmin, xmax, vmax)
    end
    return val[1], val[3]
end

I’m not sure whether this is clearer than an explicit loop, however.

Another way (for the non ambiguous strings):
PS: replaced length by sizeof for speed

arr = ["hello", "beautiful", "world", "!"]
arr[sortperm(sizeof.(arr))][1:length(arr)-1:end]
2-element Vector{String}:
 "!"
 "beautiful"

This variant allocates a bit less:

sortperm(sizeof.(arr)) |> z -> (arr[z[1]], arr[z[end]])
("!", "beautiful")
1 Like

You can reduce allocation even further with

sortperm(arr, by = length)
1 Like

Why introduce O(n log n) sorting to solve a fundamentally O(n) problem?

3 Likes

The naive implementation is actually hard to beat:

function minmax_elements(arr)
    minl = maxl = arr[1]
    smin = smax = sizeof(minl)
    for a in arr
        sa = sizeof(a)
        if sa > smax
            maxl, smax = a, sa
        elseif sa < smin
            minl, smin = a, sa
        end
    end
    minl, maxl
end 

arr = ["hello", "beautiful", "world", "!"];
@btime minmax_elements($arr)
  5.700 ns (0 allocations: 0 bytes)
("!", "beautiful")
2 Likes

@Seif_Shebl, you have activated the Julia turbo mode!

2 Likes

@Jeff_Emanuel, maybe for the pleasure of learning Julia?
PS: and thanks for the insight.

2 Likes

What about:

julia> a = ["!","beautiful"]
2-element Vector{String}:
 "!"
 "beautiful"

julia> findmax(length,a)
(9, "beautiful")

julia> a = ["!","beautiful", "beautifil"]
3-element Vector{String}:
 "!"
 "beautiful"
 "beautifil"

julia> findmax(length,a)
(9, "beautiful")
4 Likes

This is actually slightly suboptimal, as it reads the first element twice.

1 Like

@aramirezreyes, I liked your solution before trying it. It issues error in Win10 Julia 1.6.0-beta1.0:

julia> findmax(length,a)
ERROR: MethodError: no method matching findmax(::typeof(length), ::Vector{String})

Any clues?

I thought this was coming in 1.6, but actually it’s not here until 1.7: https://github.com/JuliaLang/julia/blob/master/NEWS.md#new-library-functions

2 Likes

Not a big deal in exchange for elegance. This minimal implementation is still 10X faster than findmin , findmax though.

The strange thing is that your implementation is faster than @stevengj’s when using the sizeof function. I would have thought those two implementations to be identical, except that the latter only evaluates the first element once, so should be marginally faster, but it’s the opposite.

This one does not look bad:

sizeof.(arr) |> z -> arr[argmin(z)], arr[argmax(z)]
1 Like

In Julia 1.5, mine is about 13% slower on my machine. However, in Julia 1.6, mine is 13% faster for the original 4-element array, and 24% faster for a longer array arr2 = repeat(arr, 100). I’m not sure why, though; I suspect it gets into the weeds of when the compiler is inlining things etc.

2 Likes