Filter of skipmissing fails

MWE:

julia> x = [missing, 1, 2, -1, -2];
julia> filter(e->(e>0), skipmissing(x))
ERROR: MethodError: no method matching filter(::getfield(Main, Symbol("##23#24")), ::Base.SkipMissing{Array{Union{Missing, Int64},1}})
julia> filter(e->(e>0), collect(skipmissing(x)))
2-element Array{Int64,1}:
 1
 2

Wouldn’t it be desirable to have the first filter call work? Potentially it could amount to the second one though it’s probably not ideal as per small benchmark below, likely better can be done:

julia> λ(e) = (e>0);
julia> filter_skipmissing(λ, x) = filter(e->(!ismissing(e) && λ(e)), x)
julia> x = [(rand()<0.3) ? missing : randn() for i = 1:500_000];
julia> @btime filter($λ, collect(skipmissing($x)));
  8.071 ms (38 allocations: 8.00 MiB)
julia> @btime filter_skipmissing($λ, $x);
  5.984 ms (18 allocations: 3.38 MiB)

I guess more generally something like this could be used:

julia> filter_itr(λ, itr) = [e for e ∈ itr if λ(e)];
julia> @btime filter_itr($λ, skipmissing($x));
  5.672 ms (22 allocations: 3.00 MiB)

Good question!

filter is for filtering arrays. Since skipmissing is an iterable, you can instead use Iterators.filter. An advantage of this approach is that it keeps the lazy property of the iterable. And if you do need to collect the output, call collect on the result:

julia> x = [missing, 1, 2, -1, -2];

julia> itr = Iterators.filter(e -> e > 0, skipmissing(x))
Base.Iterators.Filter{getfield(Main, Symbol("##22#23")),Base.SkipMissing{Array{Union{Missing, Int64},1}}}(getfield(Main, Symbol("##22#23"))(), Base.SkipMissing{Array{Union{Missing, Int64},1}}(Union{Missing, Int64}[missing, 1, 2, -1, -2]))

julia> collect(itr)
2-element Array{Int64,1}:
 1
 2

When collecting in the end, performance is about the same as your fastest version:

julia> λ(e) = (e>0);

julia> x = [(rand()<0.3) ? missing : randn() for i = 1:500_000];

julia> filter_itr(λ, itr) = [e for e ∈ itr if λ(e)];

julia> @btime filter_itr($λ, skipmissing($x)); # your fastest version
  6.010 ms (22 allocations: 3.00 MiB)

julia> @btime collect(Iterators.filter($λ, skipmissing($x)));
  5.959 ms (20 allocations: 3.00 MiB)
2 Likes

I don’t think that there is anything that in principle would preclude Base.filter from working on iterables.

IMO the distinction between Iterators.* and Base.* methods should be that the latter materialize the result while the former return an iterator.

4 Likes

Ok thanks that’s great and I see the ambiguity in trying to provide a default: should it be the iterator or should it be the materalised version.

I’m unfamiliar with Iterators so that will be motivation to look into it a bit more!

As a side note, I hardly find Base.filter useful in my use cases. Is there a good example that motivates the existence of non-lazy versions in Base or it is just user-friendliness to avoid explicit calls to collect?

I think the interface accumulated historically, also there are some tricky questions (eg should filter narrow type like collect — currently it doesn’t, and it would be come type unstable).

1 Like

This is also going to cause problems if you initially write code for a subset of data that doesn’t have missing values. Then when you realize the data actually does have missing values, you can’t just sprinkle skipmissing to all your function calls.

Edit: I posted an issue on github about this.

In addition to what Tamas mentioned, there’s also filter! which can be very useful. It’d be confusing I think if filter returned an iterable while filter! modified an array.

1 Like