How to searchsortedfirst when each element requires expensive transformation

MFairley · September 16, 2020, 5:31pm

I have a problem where I have an expensive function expensive() and I want to find the first integer i in 0:n such that expensive(i) > a where a is some constant and we know that expensive(j) > a for j >= i. I want to minimize the number of calls to expensive(). The inbuilt function searchsortedfirst does something similar in that it finds the index of a sorted array the first element that is greater than or equal to a threshold. Is there a way I can use that function but only evaluate expensive() as needed?

lmiq · September 16, 2020, 5:40pm

I think searchsortedfirst does what you want, just providing by what function you want the elements to be evaluated.

julia> x = sort(rand(10))
10-element Array{Float64,1}:
 0.007505240568856264
 0.170164020425573
 0.22180879545782117
 0.2423166237740766
 0.4174836106540334
 0.4778024570655919
 0.5637894682887481
 0.7442152133727271
 0.9259268508181902
 0.9533116492072238

julia> expensive(x) = x #any function

julia> searchsortedfirst(x,0.5,by=x->expensive(x))
7

(That this calls the least number of times the “expensive” function is something that I would trust in the implementation of this search)

edit: this is not working as I expected, and I don’t know why, as described below.

Henrique_Becker · September 16, 2020, 5:52pm

I am not sure why @lmiq has deleted his answer, but I think it is correct. You should call searchsortedfirst passing expensive to the by keyword parameter. Obviously, this assumes the property you mentioned, that the vector/range has the guarantee to be sorted by this expensive function even if you do not really know which exact values it would return.

The code for that is a little obfuscated inside the source (because it is generated by metaprogramming) but you can check this behaviour by putting print statements inside the expensive function.

lmiq · September 16, 2020, 5:53pm

My original answer was that using a transformation in searcshortedfirst would do what he wanted. However, I do not understand how these transformations are being interpreted in the searchsortedfirst function:

julia> x = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> searchsortedfirst(x,2)
2

julia> searchsortedfirst(x,2,by=x->x-1)
2

Shouldn’t the searchsortedfirst(x,2,by=x->x-1) return 3?

lucas711642 · September 16, 2020, 6:13pm

searchsortedfirst also applies the by function to the input 2. For example:

julia> x = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> searchsortedfirst(x, 1.5; by = x -> (println(x); x));
2
1.5
1
1.5

So I don’t think using the by keyword is the answer here (even because the expensive function would be called several times for the same input). Maybe wrapping 0:n into a new AbstractVector type that does the expensive computation when calling getindex is the best answer.

lmiq · September 16, 2020, 6:16pm

I suppose the searchsortedfirst evaluates the function a minimal number of times (or at least it is a reasonable implementation of that) to find the element, doesn’t it?

(Concerning the fact that the function is applied also to the condition, that is unnatural to me, but thanks for clarifying that).

Taking this into account, you would need to know one x such that expensive(x) is the threshold you want to follow the code I provided there.

lucas711642 · September 16, 2020, 6:20pm

My example shows that 1.5 is printed two times, so by is applied to 1.5 two times. I didn’t dig into the source code, but I suppose by will be called on the same input for each comparison using isless (or the lt keyword).

lmiq · September 16, 2020, 6:22pm

Sure, but that is a particular case in which the search implies at least two evaluations:

julia> x = collect(1:100);

julia> searchsortedfirst(x, 20.5; by = x -> (println(x); x));
50
20.5
25
20.5
12
20.5
18
20.5
21
20.5
19
20.5
20
20.5

(evaluated 7 times)

lucas711642 · September 16, 2020, 6:24pm

Of course. That’s what I meant. Note that by was evaluated 14 times (two for each of the 7 comparisons).

lucas711642 · September 16, 2020, 6:38pm

Just to make it clearer:

struct NewType{T, V, F} <: AbstractVector{T}
	v::V
	f::F
	NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType, I...) = x.f(x.v[I...])

x = 1:100
input = NewType{Int}(x, x -> (println(x); x))

searchsortedfirst(input, 20.5);

output:

MFairley · September 16, 2020, 7:24pm

@lucas711642 @lmiq thank you so much! I used the NewType method suggested by @lucas711642. I think it would be more intuitive if this was the default behavior of the inbuilt function.

lucas711642 · September 16, 2020, 7:36pm

I think the reason for the current behavior is that searchsortedfirst(v, x) is supposed to give you the position at which x must be inserted in v so that the resulting vector is still sorted. In this case, by must be applied to x too.

lmiq · September 16, 2020, 8:05pm

To be sincere, I don’t understand what you did there. Doesn’t this line evaluate the function for every element of the list?

lucas711642 · September 16, 2020, 8:22pm

I’m just creating an instance of NewType, according to the constructor defined in:

The parameter T is used just to give the compiler the eltype of NewType. If this optimization isn’t necessary, we could just define the following:

struct NewType{V, F} <: AbstractVector{Any}
	v::V
	f::F
end

But this has type-instability problems.

I’m assuming that T is the type the type of each element. Maybe a better definition of getindex is given by:

Base.@propagate_inbounds Base.getindex(x::NewType{T}, I...) where T = convert(T, x.f(x.v[I...]))::T

The ::T guarantees that x[I...] isa eltype(x).

To sum up:

struct NewType{T, V, F} <: AbstractVector{T}
	v::V
	f::F
	NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType{T}, I...) where T = convert(T, x.f(x.v[I...]))::T

input1 = NewType{Int}(1:100, x -> x ^ 2)
eltype(input1) # Int
input1[5] # outputs 25 (Int)

input2 = NewType{Float64}(1:100, x -> x ^ 2)
eltype(input2) # Float64
input2[5] # outputs 25. (Float64)

lucas711642 · September 16, 2020, 8:37pm

I think you mean in the REPL… Of course, if you execute this on the terminal you need the semicolon. Otherwise, it will call display which in turn will call getindex and execute the function unnecessarily.

lmiq · September 16, 2020, 8:41pm

Perhaps my lack of understanding comes from the fact that, when I put this line in the REPL, I get as a result, instantly, the complete vector of x^2:

julia> input1 = NewType{Int}(1:100, x -> x ^ 2)
100-element NewType{Int64,UnitRange{Int64},var"#3#4"}:
     1
     4
     9
    16
    25
     ⋮
  9216
  9409
  9604
  9801
 10000

If, instead, I suppress the show function, should I understand that the the same computation was not performed? (sorry if my question is too stupid. To be truth, it easier to me to understand the Base.searchsortedfirst function and implement it with modifications - the function is quite straightforward - than to get used to these fancy manipulation of types).

lmiq · September 16, 2020, 8:41pm

Great, that was it.

lmiq · September 17, 2020, 12:55am

Let me just finish here by saying that what I love about Julia is that we can write the code we are more familiar with, and get nice performance. Here I have just implemented a simple my_firstsortedfirst function to solve the problem above. Doing this is much more natural to me than using the elegant solution of @lucas711642, which I think involves understanding much deeper things about the workings of the type system:

using Test
using BenchmarkTools

function my_searchsortedfirst(v,x;by=x->x)
  lo = 0
  hi = length(v)
  @inbounds while hi > lo + 1
    m = trunc(Int64,(hi+lo)/2)
    if x > by(v[m])
      lo = m
    else
      hi = m
    end
  end
  return hi
end

# Expensive function that does not change the order
function f(x)
  s = 0.
  for i in 1:1_000_000
    s += sqrt(x/i)
  end
  s
end

# lucas way
struct NewType{T, V, F} <: AbstractVector{T}
  v::V
  f::F
  NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType, I...) = x.f(x.v[I...])

# input
x = f.(rand(1:100,10))
v = collect(1:100)

# test
t1 = [ my_searchsortedfirst(v,i,by=x->f(x)) for i in x ]
input = NewType{Int}(v, x -> f(x))
t2 = [ searchsortedfirst(input,i) for i in x ]
@test t1 == t2

# benchmark
x = f(20)
@btime my_searchsortedfirst($v,$x,by=x->f(x))
@btime searchsortedfirst($input,$x)

Result:

julia> include("./my_searchsortedfirst.jl")
  10.527 ms (0 allocations: 0 bytes)
  10.530 ms (0 allocations: 0 bytes)

So cool

lucas711642 · September 17, 2020, 2:59am

Cool! One minor thing that might be worth mentioning is that it is a good practice to write just f instead of x -> f(x). In this way you do not define a new function, but rather just reference to an existing one.

For example, there is an identity function that implements exactly x -> x, so your definition of my_searchsortedfirst could be written as my_searchsortedfirst(v, x; by = identity).

jondeuce · September 17, 2020, 5:35pm

By the way - NewType is essentially a simplified version of MappedArray from MappedArrays.jl, a function + array wrapper which lazily applies the transformation. For example:

A = mappedarray(f, 0:n)
searchsortedfirst(A, x)

Topic		Replies	Views
Problem with searchsortedfirst General Usage	30	1237	December 21, 2021
Alternative for `searchsortedfirst` for repeatedly searching the same array with concentrated values Performance	6	465	November 13, 2020
Fastest way to fetch an index General Usage	29	812	March 19, 2024
Findfirst on sorted array General Usage question	14	4117	June 23, 2020
Searchsorted by=<transform> keyword question General Usage question	2	327	May 20, 2020

How to searchsortedfirst when each element requires expensive transformation

Related topics