Just to note that this issue (unexpected and awkward behavior of by
) is a very old issue (4-digit issue number!) that is still open:
Searching in a collection of element type X
for an element of type Y
is not something, which should be magically work or which needs its own function parameter to handle it.
It would be sufficient to expand the documentation a bit on the by=
parameter and perhaps add an example with types X
and Y
together with other more straight forward examples.
You could do that in a matter of about an hour, which is less time than people spent here, talking about it.
I just got a sign:
Soā¦ sorry for not always being nice.
Hehe, I didnāt badge you, but what I was thinking was: If documentations and books were enough for people to learn, teachers and courses would not be necessary. There is a fallacy in the āimprove documentationā argument, if that was enough forums would become unnecessary with time, something I donāt believe will ever happen.
But back to the topic: it is not that searching, when using by
-style keywords should work for different types, is that when one is applying a transformation to the collection, it is arguably more natural (or notā¦) that the user would be searching for the elements that satisfy the result of the transformation.
The issue in particular is when the transformation does not have a simple inverse function, like:
julia> f(x) = exp(sin(x))
f (generic function with 1 method)
julia> x = collect(0:0.1:100);
julia> sort!(x,by=f);
julia> searchsortedfirst(x, 1., by=f)
817
julia> x[817]
1.0
What I would expect is that the first element for which the transformed value is 1.0, thus the result of
julia> y = f.(x);
julia> searchsortedfirst(y, 1.)
497
julia> x[497]
0.0
There might be good reasons for the behavior be the first one, and since it is trivial to write a custom function for that, all that is fine. I donāt even think that anything can be improved in the documentation, but it is ok to acknowledge that from time to time someone will be surprised by how the function works.
But I think in this case it would, because the docs are very short here.
I think an additional argument for by default not applying the by
function to the element to compare is that, if this was the default behavior, it would be trivial to get the current one: just apply manually by
to the given argument. On the other hand, with the current behavior, the only way to get the other behavior is to invert the by
function, which is not always possible.
I have to agree with @Jean_Michel here. I have already lost some hours of my time because of the current behavior, and have reached the same conclusion.
First, it is, at least to me, very intuitive that if you use the by
argument then you should give the kind of object returned when by
is applied. I say it is intuitive because it was what I have done, multiple times, even after finding the behavior was different. So my argument for intuitiveness is empirical. Of course, other people may have distinct intuitions, but for me, it appeared to be the most natural design and I wrote the code without thinking twice.
Second, in complement to Jeanās argument, sometimes the inverse
is really clunky. If I have an Vector
of struct
s and it is ordered by some field of the struct, and I need to find the struct with a specific value in that field, I will have to create an entirely mock struct
just to have an struct
with the same field set to the searched value (especially if this value is obtained by a getter and not directly available as a property, with properties you can mock a NamedTuple
). Or yet, I will have to materialize a Vector
just of the values of these fields, to search in it instead of the struct
Vector
; so now the O(log(n))
operation needs an additional O(n)
first time setup and memory. Sincerely, this behavior of searchsortedfirst
is one of the things that pushed me to use struct
s of Vector
s (instead of Vector
of struct
s) or even do not use struct
s anyway (just pass tuples of Vector
s around).
Third, this design puts a minor performance concern in the mind of the user. Is by
applied to the element being searched multiple times? If by
is not applied to the element then the answer is immediate (i.e., it is applied zero times), but if it is, then it is applied log(n)
times or it is applied a single time and stored in cache.
Itās great you share such large experience and surprising that so much baggage could be associated with this seemingly harmless functionā¦
If you could point to some sample code illustrating your point, it would be very nice.
Something I whipped up in 5 minutes.
julia> struct S; id :: Int; name :: String; end # a simple record type
julia> as = [S(i, s) for (i, s) in zip([10, 20, 30], ["John", "Mary", "Stuart"])]
3-element Array{S,1}:
S(10, "John")
S(20, "Mary")
S(30, "Stuart")
julia> searchsortedfirst(as, 20; by = s -> s.id) # what I find intuitive
ERROR: type Int64 has no field id
Stacktrace:
[1] getproperty(::Int64, ::Symbol) at ./Base.jl:33
[2] (::var"#7#8")(::Int64) at ./REPL[5]:1
[3] lt(::Base.Order.By{var"#7#8",Base.Order.ForwardOrdering}, ::S, ::Int64) at ./ordering.jl:59
[4] searchsortedfirst at ./sort.jl:183 [inlined]
[5] searchsortedfirst at ./sort.jl:324 [inlined]
[6] #searchsortedfirst#4 at ./sort.jl:326 [inlined]
[7] top-level scope at REPL[5]:1
julia> searchsortedfirst(as, (; id = 20); by = s -> s.id) # the easy workaround
2
julia> get_id(s) = s.id # but what if the access is not direct
get_id (generic function with 1 method)
julia> searchsortedfirst(as, S(20, ""); by = s -> s.id) # the ugly workaround
2
My problem is: S
here is super simple, but what if is a type that does not have an empty constructor and you need to pass many parameters which need to be valid combined to each other? And if this struct allocates but you needed to write some fast code that called searchsortedfirst
multiple times without any allocation? Do you define a new type struct MyIdWrapper; id :: Int; end
and extend get_id
with a method get_id(iw :: MyIdWrapper) = iw.id
just so you can use searchsortedfirst
in peace?
I will have some free time before the year ends. With some hope I can try my hand in that open issue to solve the searchsorted*
problem.
I have just submitted a PR that I hope solves the problem completely: https://github.com/JuliaLang/julia/pull/43509
ok, but why not just apply_by
for the keyword?
I thought the name apply_by_to_key
would be more descriptive. When this new keyword is set to false
, by
is still applied to the vector being searched but not the key. However, perhaps apply_by
is better for other reasons. Could I request that you re-post your suggestion in the discussion of the pull request?
Right now the pull request is failing all CI checks. If I understand the CI log correctly, the PR is failing because Val
is undefined. This didnāt happen when I built it locally, so I will have to look into what is causing the problem.