Get shortest/longest string in array

ferrolho · January 21, 2021, 6:26pm

What is the best way to retrieve the shortest or longest string in an array? If there are multiple candidates having the same minimum/maximum length, it’s okay to return just the first one.

This can be solved with the snippet below. Is there a better way to do it, though? Thanks in advance.

arr = ["hello", "beautiful", "world", "!"]
minl, maxl = extrema(length, arr)

shortest = filter(x -> length(x) == minl, arr) |> first
longest = filter(x -> length(x) == maxl, arr) |> first

# or even...
shortest = arr[findfirst(x -> length(x) == minl, arr)]
longest = arr[findfirst(x -> length(x) == maxl, arr)]

stevengj · January 21, 2021, 6:38pm

I would certainly use findfirst rather than filter here (since the latter creates a new array unnecessarily). But you could also easily write a function to do this in a single pass:

function extrema_elements(f, itr)
    i = iterate(itr)
    i === nothing && throw(ArgumentError("iterator must be nonempty"))
    xmin = xmax = i[1]
    vmin = vmax = f(xmin)
    while true
        i = iterate(itr, i[2])
        i === nothing && return (xmin, xmax)
        x = i[1]; v = f(x)
        if v < vmin
            xmin, vmin = x, v
        elseif v > vmax
            xmax, vmax = x, v
        end
    end
end

giving:

julia> extrema_elements(length, arr)
("!", "beautiful")

jling · January 21, 2021, 6:48pm

I have a feeling this can be done elegantly with Transducers.jl based on this example

ferrolho · January 21, 2021, 6:50pm

Thank you for the prompt reply, Professor. The custom function you suggest is great, especially for other collections besides arrays. For example, if instead of an array of strings our input was a set of strings, the filtering from my original post would work, and so would the findfirst, but it would require to first collect the set to an array; however, your function would work regardless and without unnecessary copies!

stevengj · January 21, 2021, 6:53pm

Note also that there is some ambiguity in the definition of “length”:

length(s) is the number of Unicode characters, which might not be what you expect if the text can contain combining characters like diacritical marks, e.g. length("naïve") == 5 but length("naïve") == 6. (If this confuses you, compare collect("naïve") to collect("naïve") and read about Unicode equivalence.)
sizeof(s) is quicker to compute than length (it is O(1) rather than O(n) where n is the length of s) and gives the size in bytes. (This equals the length for ASCII strings, but not for other characters; for example, sizeof("🐨") == 4 because the UTF-8 encoding of this emoji character requires 4 bytes. And sizeof("naïve") == 6 while sizeof("naïve") == 7!)
textwidth(s) gives the length of the string as measured in terminal character cells (roughly), so that textwidth("naïve") == 5. However, textwidth("🐨") == 2 because emoji are typically displayed as taking up 2 terminal cells.
length(Unicode.graphemes(s)) gives the number of Unicode “graphemes” in s, which more closely corresponds to what readers would typically consider “characters”. e.g. length(Unicode.graphemes("naïve")) == 5 and length(Unicode.graphemes("🐨")) == 1. But see the next point:
The number of on-screen symbols (“glyphs”) is ultimately determined by the font and text-rendering system. For example, the FiraCode font renders the ASCII string "->" (2 characters, 2 bytes, 2 graphemes, textwidth 2) as a single → glyph via a ligature.

stevengj · January 21, 2021, 7:02pm

You could also use foldl:

function extrema_elements(f, itr)
    x = first(itr); v = f(x)
    val = foldl(itr, init=(x,v,x,v)) do minmax, x
        v = f(x)
        xmin, vmin, xmax, vmax = minmax
        if v < vmin
            xmin, vmin = x, v
        elseif v > vmax
            xmax, vmax = x, v
        end
        (xmin, vmin, xmax, vmax)
    end
    return val[1], val[3]
end

I’m not sure whether this is clearer than an explicit loop, however.

rafael.guerra · January 21, 2021, 9:13pm

Another way (for the non ambiguous strings):
PS: replaced length by sizeof for speed

arr = ["hello", "beautiful", "world", "!"]
arr[sortperm(sizeof.(arr))][1:length(arr)-1:end]
2-element Vector{String}:
 "!"
 "beautiful"

This variant allocates a bit less:

sortperm(sizeof.(arr)) |> z -> (arr[z[1]], arr[z[end]])
("!", "beautiful")

pabloferz · January 21, 2021, 11:39pm

You can reduce allocation even further with

sortperm(arr, by = length)

Jeff_Emanuel · January 21, 2021, 11:47pm

Why introduce O(n log n) sorting to solve a fundamentally O(n) problem?

Seif_Shebl · January 21, 2021, 11:49pm

The naive implementation is actually hard to beat:

function minmax_elements(arr)
    minl = maxl = arr[1]
    smin = smax = sizeof(minl)
    for a in arr
        sa = sizeof(a)
        if sa > smax
            maxl, smax = a, sa
        elseif sa < smin
            minl, smin = a, sa
        end
    end
    minl, maxl
end 

arr = ["hello", "beautiful", "world", "!"];
@btime minmax_elements($arr)
  5.700 ns (0 allocations: 0 bytes)
("!", "beautiful")

rafael.guerra · January 22, 2021, 12:07am

@Seif_Shebl, you have activated the Julia turbo mode!

rafael.guerra · January 22, 2021, 12:09am

@Jeff_Emanuel, maybe for the pleasure of learning Julia?
PS: and thanks for the insight.

aramirezreyes · January 22, 2021, 12:24am

What about:

julia> a = ["!","beautiful"]
2-element Vector{String}:
 "!"
 "beautiful"

julia> findmax(length,a)
(9, "beautiful")

julia> a = ["!","beautiful", "beautifil"]
3-element Vector{String}:
 "!"
 "beautiful"
 "beautifil"

julia> findmax(length,a)
(9, "beautiful")

DNF · January 22, 2021, 5:57am

This is actually slightly suboptimal, as it reads the first element twice.

rafael.guerra · January 22, 2021, 7:38am

@aramirezreyes, I liked your solution before trying it. It issues error in Win10 Julia 1.6.0-beta1.0:

julia> findmax(length,a)
ERROR: MethodError: no method matching findmax(::typeof(length), ::Vector{String})

Any clues?

DNF · January 22, 2021, 7:49am

I thought this was coming in 1.6, but actually it’s not here until 1.7: https://github.com/JuliaLang/julia/blob/master/NEWS.md#new-library-functions

Seif_Shebl · January 22, 2021, 10:44am

Not a big deal in exchange for elegance. This minimal implementation is still 10X faster than findmin , findmax though.

DNF · January 22, 2021, 11:17am

The strange thing is that your implementation is faster than @stevengj’s when using the sizeof function. I would have thought those two implementations to be identical, except that the latter only evaluates the first element once, so should be marginally faster, but it’s the opposite.

rafael.guerra · January 22, 2021, 12:14pm

This one does not look bad:

sizeof.(arr) |> z -> arr[argmin(z)], arr[argmax(z)]

stevengj · January 22, 2021, 1:06pm

In Julia 1.5, mine is about 13% slower on my machine. However, in Julia 1.6, mine is 13% faster for the original 4-element array, and 24% faster for a longer array arr2 = repeat(arr, 100). I’m not sure why, though; I suspect it gets into the weeds of when the compiler is inlining things etc.

Topic		Replies	Views
Standard way to get first element of iterator satisfying predicate New to Julia	12	4940	November 12, 2020
`findmax` and friends: confusing behaviour to be introduced in 1.7 Internals & Design	96	5969	July 6, 2024
What's the best way to compare items in an array to every other item? General Usage performance , strings , arrays	11	3487	July 26, 2019
Find index of maximum value of array New to Julia question	1	12165	December 9, 2020
String indexing New to Julia indexing	11	3250	April 11, 2020

Get shortest/longest string in array

Related topics