Find filtered min/max index

Every now and then I need to find the index of the min/max element of a filtered data structure.
I was wondering, how do you normally tackle this quite common problem?

As a first reaction I think of findmin(o, filter(f, data)) but filtering doesn’t hold the original indices.
I ended up using findmin(o, skipmissing([f(d) ? d : misssing for d in data]) where I inject missing values in the values that should be filtered out.

MWE:

julia> using Random

julia> rng = MersenneTwister(0);

julia> data = [(rand(rng), rand(rng)) for _ in 1:10];

julia> dataobjective(d) = d[2];

julia> filterdata(d) = d[1] > 0.5;

julia> findmin(dataobjective, skipmissing([filterdata(d) ? d : missing for d in data]))
(0.5392892841426182, 6)

It’s very convenient that skipmissing is propagating the original indices. Is there something like skipfilter that would do exactly as skipmissing but for elements satisfying a predicate ?
And how would you solve the above problem ?

Check out @aplavin’s Skipper.jl, which provides an arbitrary-predicate skip that should totally be in Base imo.

5 Likes
minimum(t for t in  zip(last.(data),eachindex(data)) if t[1]>0.5)

the elegant @lmiq solution applied to the (modified :smile: )OP case

findmin(x -> x[2] > 0.5 ? x[2] : +Inf, data) 

You can use the predicate to filter, for example:

julia> data = collect(1:10)
10-element Vector{Int64}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> findmin(x -> x > 3 ? x : +Inf, data)
(4, 4)

or, similar to the OP example:

julia> data = [(rand(),rand()) for _ in 1:10];

julia> findmin(d -> d[1] > 0.5 ? d[2] : +Inf, data)
(0.42413123524471796, 8)
1 Like

okay. I read the OP more carefully

minimum((t[1][2],t[2]) for t in  zip(data, eachindex(data)) if t[1][1] > 0.5)

Link?

I also had problem finding it with google/bing. It appears they don’t parse gitlab that well as github.

1 Like

How about:

julia> withmissing = [filterdata(d) ? d : missing for d in data]

julia> findmin(map(withmissing) do x 
    ismissing(x) ? (true,0.0) : (false, dataobjective(x)) 
  end)
((false, 0.5392892841426182), 6)

This generalizes well to other cases, but needs a tiny addition to access the value. More importantly, it handles the all missing case which could potentially cause trouble.

Or, same as rocco_sprmnt21 's but perhaps more readable:

minimum( [ (dataobjective(x),i) for (i,x) in enumerate(withmissing)
          if !ismissing(x) ] )

That should definitely go to base. Why it is not?

Just checked it out:

julia> using Skipper

julia> data = collect(1:10);

julia> findmin(skip(x -> x < 4, data))
(4, 4)

julia> @btime findmin(skip(x -> x < 4, $data))
  21.599 ns (0 allocations: 0 bytes)
(4, 4)
2 Likes

Aside from inverting the predicate, what is the difference between skip and Iterators.filter?

The intent of filter is to construct an entirely new collection excluding the filtered elements, and Iterators.filter is a lazy variant.

The intent of skipmissing (and skip) is different: we intend to continue working with the same collection, but we’re specifying that for this operation certain elements won’t be under consideration.

The example here is a good one for demonstrating the difference; because filter returns a brand new collection, the indices will be entirely different from the original collection. In reflection of this, Iterators.filter isn’t even indexable. skip, on the other hand, does what we want here.

julia> using Skipper
       data = collect(1:10)
       findmin(skip(<(4), data))
(4, 4)

julia> findmin(Iterators.filter(!<(4), data))
ERROR: MethodError: no method matching keys(::Base.Iterators.Filter{ComposedFunction{typeof(!), Base.Fix2{typeof(<), Int64}}, Vector{Int64}})
1 Like

In other words, skip makes a view while filter and Iterators.filter makes a copy?

Sort of. I’d say skip creates a view of the original collection. In contrast, filter creates an entirely new (copied and modified) collection, and Iterators.filter creates a view of what the new collection will look like without actually collecting it.

When I think about it though, outside from the fact that we want generators to resemble comprehensions, I can’t think of any cases that you’d use Iterators.filter where you couldn’t use skip, but I can think of plenty of cases where you’d use skip where you couldn’t use Iterators.filter. Maybe a lazy filter was never quite the right idea to begin with?

Either that, or maybe the wrong idea is actually eager filter: by implicitly creating a new collection it has trained us to believe a filtered collection is a new collection—maybe Base.filter should actually be lazy and act like skip with its predicate inversed, and you’d have to explicitly call collect for the indices to change. That would be more efficient when building filter->map->reduce chains, and by preserving indices it’d also make skip unnecessary.

Definitely brings to mind @jakobnissen’s complaints about Julia’s FP primitives.

I don’t think anybody has opened an issue or PR.

2 Likes

I’ll just mention the opposite of skip could be keep. I know some people find the name “filter” confusing since it sounds like it’s filtering out but it’s the opposite.

2 Likes

Would it be possible to transition Julia’s generator and comprehension syntax to lower to a Keep type instead of a Filter type?

If anything could justify a Julia 2.0, it might be the Base FP primitives. They’re FUBAR and guide people into all sorts of performance traps. Hard to claim your language is performant, if the base primitives guide the ecosystem into making slow code. And for a language whose raison d’être is performance, that’s a matter of existential import.

As an aside, collect is FUBAR too. Its current role should be satisfied by the Array constructor, and then collect could be used for collecting all sorts of other collection types.

1 Like

If that happens, I’d also consider changing the Base functions to enable error handling and possibly passing memory allocators.

To find only the index, there is argmin.
Using the functions in the OP:

# 22 ns (0 allocs: 0 bytes)
argmin(filterdata(d) ? dataobjective(d) : +Inf for (_,d) in pairs(data))
1 Like

Iterators.filter is also lazy. The difference is that filters don’t keep the original indices, while skipmissing in Base and skip in Skipper do keep. filter cannot keep original indices by design.

1 Like

regarding performance the following version

last(reduce((s,c)->(c[1]>.5 && c[2]<s[2]) ? (s[1]+1,c[2],s[1]+1) : (s[1]+1, s[2],s[3]), data, init=(0,Inf,0))) 

seems to rival even a for loop form



julia> @btime last(reduce((s,c)->(c[1]>.5 && c[2]<s[2]) ? (s[1]+1,c[2],s[1]+1) : (s[1]+1, s[2],s[3]), $data, init=(0,Inf,0)))
  10.000 ns (0 allocations: 0 bytes)
3

julia> @btime argmin(i -> filterdata($data[i]) ? dataobjective($data[i]) : +Inf, axes($data,1))
  23.370 ns (0 allocations: 0 bytes)
3

julia> data = [(rand(rng), rand(rng)) for _ in 1:10^6];    

julia> @btime argmin(i -> first($data[i])>0.5 ? last($data[i]) : +Inf, axes($data,1))
  4.612 ms (0 allocations: 0 bytes)
669781

julia> @btime last(reduce((s,c)->(c[1]>.5 && c[2]<s[2]) ? (s[1]+1,c[2],s[1]+1) : (s[1]+1, s[2],s[3]), $data, init=(0,Inf,0)))
  1.784 ms (0 allocations: 0 bytes)
669781

julia> @btime argmin(i -> filterdata($data[i]) ? dataobjective($data[i]) : +Inf, axes($data,1))
  4.668 ms (0 allocations: 0 bytes)
669781
function ffm(arr)
    i,im, min=0, 0, Inf
    for e in arr
        i+=1
        if e[1]>.5 && e[2]<min
            im, min = i, e[2]
        end
    end
    im
end
julia> @btime ffm($data)
  1.804 ms (0 allocations: 0 bytes)
669781

Is there a running list of things that should be fixed in 2.0? And is there a Julia fork that would allow package devs to start porting their code over to it?

If you were designing this interface from scratch, starting with the lazy methods instead of the eager methods, would you make filter keep the original indices?