Problem with searchsortedfirst

Just to note that this issue (unexpected and awkward behavior of by) is a very old issue (4-digit issue number!) that is still open:

https://github.com/JuliaLang/julia/issues/9429

Searching in a collection of element type X for an element of type Y is not something, which should be magically work or which needs its own function parameter to handle it.

It would be sufficient to expand the documentation a bit on the by= parameter and perhaps add an example with types X and Y together with other more straight forward examples.

You could do that in a matter of about an hour, which is less time than people spent here, talking about it.

I just got a sign:
image
Soā€¦ sorry for not always being nice.

Hehe, I didnā€™t badge you, but what I was thinking was: If documentations and books were enough for people to learn, teachers and courses would not be necessary. There is a fallacy in the ā€œimprove documentationā€ argument, if that was enough forums would become unnecessary with time, something I donā€™t believe will ever happen.

But back to the topic: it is not that searching, when using by-style keywords should work for different types, is that when one is applying a transformation to the collection, it is arguably more natural (or notā€¦) that the user would be searching for the elements that satisfy the result of the transformation.

The issue in particular is when the transformation does not have a simple inverse function, like:

julia> f(x) = exp(sin(x))
f (generic function with 1 method)

julia> x = collect(0:0.1:100);

julia> sort!(x,by=f);

julia> searchsortedfirst(x, 1., by=f)
817

julia> x[817]
1.0

What I would expect is that the first element for which the transformed value is 1.0, thus the result of

julia> y = f.(x);

julia> searchsortedfirst(y, 1.)
497

julia> x[497]
0.0

There might be good reasons for the behavior be the first one, and since it is trivial to write a custom function for that, all that is fine. I donā€™t even think that anything can be improved in the documentation, but it is ok to acknowledge that from time to time someone will be surprised by how the function works.

1 Like

But I think in this case it would, because the docs are very short here.

3 Likes

I think an additional argument for by default not applying the by function to the element to compare is that, if this was the default behavior, it would be trivial to get the current one: just apply manually by to the given argument. On the other hand, with the current behavior, the only way to get the other behavior is to invert the by function, which is not always possible.

2 Likes

I have to agree with @Jean_Michel here. I have already lost some hours of my time because of the current behavior, and have reached the same conclusion.

First, it is, at least to me, very intuitive that if you use the by argument then you should give the kind of object returned when by is applied. I say it is intuitive because it was what I have done, multiple times, even after finding the behavior was different. So my argument for intuitiveness is empirical. Of course, other people may have distinct intuitions, but for me, it appeared to be the most natural design and I wrote the code without thinking twice.

Second, in complement to Jeanā€™s argument, sometimes the inverse is really clunky. If I have an Vector of structs and it is ordered by some field of the struct, and I need to find the struct with a specific value in that field, I will have to create an entirely mock struct just to have an struct with the same field set to the searched value (especially if this value is obtained by a getter and not directly available as a property, with properties you can mock a NamedTuple). Or yet, I will have to materialize a Vector just of the values of these fields, to search in it instead of the struct Vector; so now the O(log(n)) operation needs an additional O(n) first time setup and memory. Sincerely, this behavior of searchsortedfirst is one of the things that pushed me to use structs of Vectors (instead of Vector of structs) or even do not use structs anyway (just pass tuples of Vectors around).

Third, this design puts a minor performance concern in the mind of the user. Is by applied to the element being searched multiple times? If by is not applied to the element then the answer is immediate (i.e., it is applied zero times), but if it is, then it is applied log(n) times or it is applied a single time and stored in cache.

4 Likes

Itā€™s great you share such large experience and surprising that so much baggage could be associated with this seemingly harmless functionā€¦

If you could point to some sample code illustrating your point, it would be very nice.

Something I whipped up in 5 minutes.

julia> struct S; id :: Int; name :: String; end # a simple record type

julia> as = [S(i, s) for (i, s) in zip([10, 20, 30], ["John", "Mary", "Stuart"])]
3-element Array{S,1}:
 S(10, "John")
 S(20, "Mary")
 S(30, "Stuart")

julia> searchsortedfirst(as, 20; by = s -> s.id) # what I find intuitive
ERROR: type Int64 has no field id
Stacktrace:
 [1] getproperty(::Int64, ::Symbol) at ./Base.jl:33
 [2] (::var"#7#8")(::Int64) at ./REPL[5]:1
 [3] lt(::Base.Order.By{var"#7#8",Base.Order.ForwardOrdering}, ::S, ::Int64) at ./ordering.jl:59
 [4] searchsortedfirst at ./sort.jl:183 [inlined]
 [5] searchsortedfirst at ./sort.jl:324 [inlined]
 [6] #searchsortedfirst#4 at ./sort.jl:326 [inlined]
 [7] top-level scope at REPL[5]:1


julia> searchsortedfirst(as, (; id = 20); by = s -> s.id) # the easy workaround
2

julia> get_id(s) = s.id # but what if the access is not direct
get_id (generic function with 1 method)

julia> searchsortedfirst(as, S(20, ""); by = s -> s.id) # the ugly workaround
2

My problem is: S here is super simple, but what if is a type that does not have an empty constructor and you need to pass many parameters which need to be valid combined to each other? And if this struct allocates but you needed to write some fast code that called searchsortedfirst multiple times without any allocation? Do you define a new type struct MyIdWrapper; id :: Int; end and extend get_id with a method get_id(iw :: MyIdWrapper) = iw.id just so you can use searchsortedfirst in peace?

I will have some free time before the year ends. With some hope I can try my hand in that open issue to solve the searchsorted* problem.

3 Likes

I have just submitted a PR that I hope solves the problem completely: https://github.com/JuliaLang/julia/pull/43509

3 Likes

ok, but why not just apply_by for the keyword?

I thought the name apply_by_to_key would be more descriptive. When this new keyword is set to false, by is still applied to the vector being searched but not the key. However, perhaps apply_by is better for other reasons. Could I request that you re-post your suggestion in the discussion of the pull request?

Right now the pull request is failing all CI checks. If I understand the CI log correctly, the PR is failing because Val is undefined. This didnā€™t happen when I built it locally, so I will have to look into what is causing the problem.