Filter of skipmissing fails

tlienart · February 27, 2019, 6:54am

MWE:

julia> x = [missing, 1, 2, -1, -2];
julia> filter(e->(e>0), skipmissing(x))
ERROR: MethodError: no method matching filter(::getfield(Main, Symbol("##23#24")), ::Base.SkipMissing{Array{Union{Missing, Int64},1}})
julia> filter(e->(e>0), collect(skipmissing(x)))
2-element Array{Int64,1}:
 1
 2

Wouldn’t it be desirable to have the first filter call work? Potentially it could amount to the second one though it’s probably not ideal as per small benchmark below, likely better can be done:

julia> λ(e) = (e>0);
julia> filter_skipmissing(λ, x) = filter(e->(!ismissing(e) && λ(e)), x)
julia> x = [(rand()<0.3) ? missing : randn() for i = 1:500_000];
julia> @btime filter($λ, collect(skipmissing($x)));
  8.071 ms (38 allocations: 8.00 MiB)
julia> @btime filter_skipmissing($λ, $x);
  5.984 ms (18 allocations: 3.38 MiB)

I guess more generally something like this could be used:

julia> filter_itr(λ, itr) = [e for e ∈ itr if λ(e)];
julia> @btime filter_itr($λ, skipmissing($x));
  5.672 ms (22 allocations: 3.00 MiB)

bennedich · February 27, 2019, 7:40am

Good question!

filter is for filtering arrays. Since skipmissing is an iterable, you can instead use Iterators.filter. An advantage of this approach is that it keeps the lazy property of the iterable. And if you do need to collect the output, call collect on the result:

julia> x = [missing, 1, 2, -1, -2];

julia> itr = Iterators.filter(e -> e > 0, skipmissing(x))
Base.Iterators.Filter{getfield(Main, Symbol("##22#23")),Base.SkipMissing{Array{Union{Missing, Int64},1}}}(getfield(Main, Symbol("##22#23"))(), Base.SkipMissing{Array{Union{Missing, Int64},1}}(Union{Missing, Int64}[missing, 1, 2, -1, -2]))

julia> collect(itr)
2-element Array{Int64,1}:
 1
 2

When collecting in the end, performance is about the same as your fastest version:

julia> λ(e) = (e>0);

julia> x = [(rand()<0.3) ? missing : randn() for i = 1:500_000];

julia> filter_itr(λ, itr) = [e for e ∈ itr if λ(e)];

julia> @btime filter_itr($λ, skipmissing($x)); # your fastest version
  6.010 ms (22 allocations: 3.00 MiB)

julia> @btime collect(Iterators.filter($λ, skipmissing($x)));
  5.959 ms (20 allocations: 3.00 MiB)

Tamas_Papp · February 27, 2019, 7:49am

I don’t think that there is anything that in principle would preclude Base.filter from working on iterables.

IMO the distinction between Iterators.* and Base.* methods should be that the latter materialize the result while the former return an iterator.

tlienart · February 27, 2019, 10:30am

Ok thanks that’s great and I see the ambiguity in trying to provide a default: should it be the iterator or should it be the materalised version.

I’m unfamiliar with Iterators so that will be motivation to look into it a bit more!

juliohm · February 27, 2019, 11:02am

As a side note, I hardly find Base.filter useful in my use cases. Is there a good example that motivates the existence of non-lazy versions in Base or it is just user-friendliness to avoid explicit calls to collect?

Tamas_Papp · February 27, 2019, 1:21pm

I think the interface accumulated historically, also there are some tricky questions (eg should filter narrow type like collect — currently it doesn’t, and it would be come type unstable).

pdeffebach · February 27, 2019, 3:59pm

This is also going to cause problems if you initially write code for a subset of data that doesn’t have missing values. Then when you realize the data actually does have missing values, you can’t just sprinkle skipmissing to all your function calls.

Edit: I posted an issue on github about this.

bennedich · February 27, 2019, 7:44pm

In addition to what Tamas mentioned, there’s also filter! which can be very useful. It’d be confusing I think if filter returned an iterable while filter! modified an array.

Topic		Replies	Views
Two questions about filtering General Usage question	3	749	February 18, 2018
Find filtered min/max index General Usage question	22	1031	June 4, 2023
Skipnothing is missing General Usage	8	2576	October 6, 2021
Improve type handling in skipmissing Internals & Design	8	605	January 19, 2020
Test on skipmissing failing but on collect(skipmissing) passing General Usage	1	159	August 22, 2022

Filter of skipmissing fails

Related topics