What is the best way to retrieve the shortest or longest string in an array? If there are multiple candidates having the same minimum/maximum length, it’s okay to return just the first one.
This can be solved with the snippet below. Is there a better way to do it, though? Thanks in advance.
I would certainly use findfirst rather than filter here (since the latter creates a new array unnecessarily). But you could also easily write a function to do this in a single pass:
function extrema_elements(f, itr)
i = iterate(itr)
i === nothing && throw(ArgumentError("iterator must be nonempty"))
xmin = xmax = i[1]
vmin = vmax = f(xmin)
while true
i = iterate(itr, i[2])
i === nothing && return (xmin, xmax)
x = i[1]; v = f(x)
if v < vmin
xmin, vmin = x, v
elseif v > vmax
xmax, vmax = x, v
end
end
end
Thank you for the prompt reply, Professor. The custom function you suggest is great, especially for other collections besides arrays. For example, if instead of an array of strings our input was a set of strings, the filtering from my original post would work, and so would the findfirst, but it would require to first collect the set to an array; however, your function would work regardless and without unnecessary copies!
Note also that there is some ambiguity in the definition of “length”:
length(s) is the number of Unicode characters, which might not be what you expect if the text can contain combining characters like diacritical marks, e.g. length("naïve") == 5 but length("naïve") == 6. (If this confuses you, compare collect("naïve") to collect("naïve") and read about Unicode equivalence.)
sizeof(s) is quicker to compute than length (it is O(1) rather than O(n) where n is the length of s) and gives the size in bytes. (This equals the length for ASCII strings, but not for other characters; for example, sizeof("🐨") == 4 because the UTF-8 encoding of this emoji character requires 4 bytes. And sizeof("naïve") == 6 while sizeof("naïve") == 7!)
textwidth(s) gives the length of the string as measured in terminal character cells (roughly), so that textwidth("naïve") == 5. However, textwidth("🐨") == 2 because emoji are typically displayed as taking up 2 terminal cells.
length(Unicode.graphemes(s)) gives the number of Unicode “graphemes” in s, which more closely corresponds to what readers would typically consider “characters”. e.g. length(Unicode.graphemes("naïve")) == 5 and length(Unicode.graphemes("🐨")) == 1. But see the next point:
The number of on-screen symbols (“glyphs”) is ultimately determined by the font and text-rendering system. For example, the FiraCode font renders the ASCII string "->" (2 characters, 2 bytes, 2 graphemes, textwidth 2) as a single → glyph via a ligature.
function extrema_elements(f, itr)
x = first(itr); v = f(x)
val = foldl(itr, init=(x,v,x,v)) do minmax, x
v = f(x)
xmin, vmin, xmax, vmax = minmax
if v < vmin
xmin, vmin = x, v
elseif v > vmax
xmax, vmax = x, v
end
(xmin, vmin, xmax, vmax)
end
return val[1], val[3]
end
I’m not sure whether this is clearer than an explicit loop, however.
The naive implementation is actually hard to beat:
function minmax_elements(arr)
minl = maxl = arr[1]
smin = smax = sizeof(minl)
for a in arr
sa = sizeof(a)
if sa > smax
maxl, smax = a, sa
elseif sa < smin
minl, smin = a, sa
end
end
minl, maxl
end
arr = ["hello", "beautiful", "world", "!"];
@btime minmax_elements($arr)
5.700 ns (0 allocations: 0 bytes)
("!", "beautiful")
The strange thing is that your implementation is faster than @stevengj’s when using the sizeof function. I would have thought those two implementations to be identical, except that the latter only evaluates the first element once, so should be marginally faster, but it’s the opposite.
In Julia 1.5, mine is about 13% slower on my machine. However, in Julia 1.6, mine is 13% faster for the original 4-element array, and 24% faster for a longer array arr2 = repeat(arr, 100). I’m not sure why, though; I suspect it gets into the weeds of when the compiler is inlining things etc.