How to searchsortedfirst when each element requires expensive transformation

I have a problem where I have an expensive function expensive() and I want to find the first integer i in 0:n such that expensive(i) > a where a is some constant and we know that expensive(j) > a for j >= i. I want to minimize the number of calls to expensive(). The inbuilt function searchsortedfirst does something similar in that it finds the index of a sorted array the first element that is greater than or equal to a threshold. Is there a way I can use that function but only evaluate expensive() as needed?

I think searchsortedfirst does what you want, just providing by what function you want the elements to be evaluated.

julia> x = sort(rand(10))
10-element Array{Float64,1}:
 0.007505240568856264
 0.170164020425573
 0.22180879545782117
 0.2423166237740766
 0.4174836106540334
 0.4778024570655919
 0.5637894682887481
 0.7442152133727271
 0.9259268508181902
 0.9533116492072238

julia> expensive(x) = x #any function

julia> searchsortedfirst(x,0.5,by=x->expensive(x))
7

(That this calls the least number of times the “expensive” function is something that I would trust in the implementation of this search)

edit: this is not working as I expected, and I don’t know why, as described below.

1 Like

I am not sure why @lmiq has deleted his answer, but I think it is correct. You should call searchsortedfirst passing expensive to the by keyword parameter. Obviously, this assumes the property you mentioned, that the vector/range has the guarantee to be sorted by this expensive function even if you do not really know which exact values it would return.

The code for that is a little obfuscated inside the source (because it is generated by metaprogramming) but you can check this behaviour by putting print statements inside the expensive function.

My original answer was that using a transformation in searcshortedfirst would do what he wanted. However, I do not understand how these transformations are being interpreted in the searchsortedfirst function:

julia> x = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> searchsortedfirst(x,2)
2

julia> searchsortedfirst(x,2,by=x->x-1)
2

Shouldn’t the searchsortedfirst(x,2,by=x->x-1) return 3?

searchsortedfirst also applies the by function to the input 2. For example:

julia> x = [1, 2, 3]
3-element Array{Int64,1}:
 1
 2
 3

julia> searchsortedfirst(x, 1.5; by = x -> (println(x); x));
2
1.5
1
1.5

So I don’t think using the by keyword is the answer here (even because the expensive function would be called several times for the same input). Maybe wrapping 0:n into a new AbstractVector type that does the expensive computation when calling getindex is the best answer.

1 Like

I suppose the searchsortedfirst evaluates the function a minimal number of times (or at least it is a reasonable implementation of that) to find the element, doesn’t it?

(Concerning the fact that the function is applied also to the condition, that is unnatural to me, but thanks for clarifying that).

Taking this into account, you would need to know one x such that expensive(x) is the threshold you want to follow the code I provided there.

My example shows that 1.5 is printed two times, so by is applied to 1.5 two times. I didn’t dig into the source code, but I suppose by will be called on the same input for each comparison using isless (or the lt keyword).

Sure, but that is a particular case in which the search implies at least two evaluations:

julia> x = collect(1:100);

julia> searchsortedfirst(x, 20.5; by = x -> (println(x); x));
50
20.5
25
20.5
12
20.5
18
20.5
21
20.5
19
20.5
20
20.5

(evaluated 7 times)

1 Like

Of course. That’s what I meant. Note that by was evaluated 14 times (two for each of the 7 comparisons).

2 Likes

Just to make it clearer:

struct NewType{T, V, F} <: AbstractVector{T}
	v::V
	f::F
	NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType, I...) = x.f(x.v[I...])

x = 1:100
input = NewType{Int}(x, x -> (println(x); x))

searchsortedfirst(input, 20.5);

output:

50
25
12
18
21
19
20
1 Like

@lucas711642 @lmiq thank you so much! I used the NewType method suggested by @lucas711642. I think it would be more intuitive if this was the default behavior of the inbuilt function.

I think the reason for the current behavior is that searchsortedfirst(v, x) is supposed to give you the position at which x must be inserted in v so that the resulting vector is still sorted. In this case, by must be applied to x too.

To be sincere, I don’t understand what you did there. Doesn’t this line evaluate the function for every element of the list?

I’m just creating an instance of NewType, according to the constructor defined in:

The parameter T is used just to give the compiler the eltype of NewType. If this optimization isn’t necessary, we could just define the following:

struct NewType{V, F} <: AbstractVector{Any}
	v::V
	f::F
end

But this has type-instability problems.

I’m assuming that T is the type the type of each element. Maybe a better definition of getindex is given by:

Base.@propagate_inbounds Base.getindex(x::NewType{T}, I...) where T = convert(T, x.f(x.v[I...]))::T

The ::T guarantees that x[I...] isa eltype(x).

To sum up:

struct NewType{T, V, F} <: AbstractVector{T}
	v::V
	f::F
	NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType{T}, I...) where T = convert(T, x.f(x.v[I...]))::T

input1 = NewType{Int}(1:100, x -> x ^ 2)
eltype(input1) # Int
input1[5] # outputs 25 (Int)

input2 = NewType{Float64}(1:100, x -> x ^ 2)
eltype(input2) # Float64
input2[5] # outputs 25. (Float64)
4 Likes

I think you mean in the REPL… Of course, if you execute this on the terminal you need the semicolon. Otherwise, it will call display which in turn will call getindex and execute the function unnecessarily.

1 Like

Perhaps my lack of understanding comes from the fact that, when I put this line in the REPL, I get as a result, instantly, the complete vector of x^2:

julia> input1 = NewType{Int}(1:100, x -> x ^ 2)
100-element NewType{Int64,UnitRange{Int64},var"#3#4"}:
     1
     4
     9
    16
    25
     â‹®
  9216
  9409
  9604
  9801
 10000

If, instead, I suppress the show function, should I understand that the the same computation was not performed? (sorry if my question is too stupid. To be truth, it easier to me to understand the Base.searchsortedfirst function and implement it with modifications - the function is quite straightforward - than to get used to these fancy manipulation of types).

Great, that was it.

1 Like

Let me just finish here by saying that what I love about Julia is that we can write the code we are more familiar with, and get nice performance. Here I have just implemented a simple my_firstsortedfirst function to solve the problem above. Doing this is much more natural to me than using the elegant solution of @lucas711642, which I think involves understanding much deeper things about the workings of the type system:

using Test
using BenchmarkTools

function my_searchsortedfirst(v,x;by=x->x)
  lo = 0
  hi = length(v)
  @inbounds while hi > lo + 1
    m = trunc(Int64,(hi+lo)/2)
    if x > by(v[m])
      lo = m
    else
      hi = m
    end
  end
  return hi
end

# Expensive function that does not change the order
function f(x)
  s = 0.
  for i in 1:1_000_000
    s += sqrt(x/i)
  end
  s
end

# lucas way
struct NewType{T, V, F} <: AbstractVector{T}
  v::V
  f::F
  NewType{T}(v::V, f::F) where {T, V, F} = new{T, V, F}(v, f)
end
Base.size(x::NewType) = size(x.v)
Base.axes(x::NewType) = axes(x.v)
Base.@propagate_inbounds Base.getindex(x::NewType, I...) = x.f(x.v[I...])

# input
x = f.(rand(1:100,10))
v = collect(1:100)

# test
t1 = [ my_searchsortedfirst(v,i,by=x->f(x)) for i in x ]
input = NewType{Int}(v, x -> f(x))
t2 = [ searchsortedfirst(input,i) for i in x ]
@test t1 == t2

# benchmark
x = f(20)
@btime my_searchsortedfirst($v,$x,by=x->f(x))
@btime searchsortedfirst($input,$x)


Result:

julia> include("./my_searchsortedfirst.jl")
  10.527 ms (0 allocations: 0 bytes)
  10.530 ms (0 allocations: 0 bytes)

So cool :slightly_smiling_face:

2 Likes

Cool! One minor thing that might be worth mentioning is that it is a good practice to write just f instead of x -> f(x). In this way you do not define a new function, but rather just reference to an existing one.

For example, there is an identity function that implements exactly x -> x, so your definition of my_searchsortedfirst could be written as my_searchsortedfirst(v, x; by = identity).

2 Likes

By the way - NewType is essentially a simplified version of MappedArray from MappedArrays.jl, a function + array wrapper which lazily applies the transformation. For example:

A = mappedarray(f, 0:n)
searchsortedfirst(A, x)
3 Likes