Two questions about filtering

The first question is about filtering to remove missing data. There seems to be a major (500x) performance difference between using Iterators.filter(!ismissing, v) and skipmissing (in favor of the latter) and I wanted to understand why that is and whether that can be fixed or if it is recommended to stick with skipmissing

julia> v = missings(Float64, 1000000);

julia> for i in 2:2:1000000
       v[i] = 1.0
julia> @benchmark collect(skipmissing($v))
  memory estimate:  5.00 MiB
  allocs estimate:  20
  minimum time:     5.857 ms (0.00% GC)
  median time:      6.155 ms (0.00% GC)
  mean time:        6.198 ms (0.66% GC)
  maximum time:     13.369 ms (53.76% GC)
  samples:          798
  evals/sample:     1

julia> @benchmark collect(Iterators.filter(!ismissing, $v))
  memory estimate:  97.17 MiB
  allocs estimate:  3999764
  minimum time:     2.963 s (0.17% GC)
  median time:      3.056 s (0.73% GC)
  mean time:        3.056 s (0.73% GC)
  maximum time:     3.150 s (1.27% GC)
  samples:          2
  evals/sample:     1

Also, the filter version returns an Vector{Union{Missing, Float64}} and I wonder whether type should be tightened when collecting a filtered iterator of a Array of unions or be the same as the original type.

The second question is about checking if there are any values in a filtered iterator (before applying some function). The naive way would be to check if it is empty and then apply the function. For example:

julia> v = Iterators.filter(i -> (println(i); isfinite(i)), [Inf, Inf, 1.5])
Base.Iterators.Filter{getfield(, Symbol("##17#18")),Array{Float64,1}}(getfield(, Symbol("##17#18"))(), [Inf, Inf, 1.5])

julia> isempty(v) || println("mean is $(mean(v))")
mean is 1.5

What I wanted to know is the following: is there a way to avoid going through the iterator twice? Could some variant of isempty give me an iterator that wouldn’t check the filtering function on the initial part of the array?

I tried playing a bit with start, next done but got a bit lost. I think I want to precompute s = start(v) (which gives the first accepted element) and check done(v, s). If that returns true, the iterator is empty and I’m done. Otherwise, I’d like to return an iterator where I already know the value of start (without going through the first part again). Is there a way to do that, “resuming” iterating through an iterator? As a brute force approach, one can create a custom iterator type with the same next and done as the initial iterator but accepting a custom start value, but maybe somebody did that already.

  1. I would argue one should always use skipmissing if possible.
  2. You can inspect the implementation here (it uses EachSkipMissing(itr).
  3. I would say that for most applications filtering missing values should still return the same struct as it comes from the fact that it makes sense to have missing values in the data. There are some exceptions such as in a Statistical Model where the data is considered as only the subset with full coverage.
  4. One cannot guarantee that no value is missing in a struct that supports it without doing a full sweep.
  5. You could have a function that has an iterator and yield to the task in some cases or does some other behavior. That might be what you looking for: docs.

The following confuses me a bit:

julia> collect(Iterators.filter(!ismissing, (i for i in [missing, 1, 2])))
2-element Array{Int64,1}:

julia> collect(Iterators.filter(!ismissing, [missing, 1, 2]))
2-element Array{Union{Missing, Int64},1}:

I thought that collecting an iterator should have a return type based only on the elements that are iterated (which in this case are all integers). Instead here depending on the container that is passed to Iterators.filter different things happen.

I used to think this as well! But iterators can specify what type they have:

Base.IteratorEltype(Iterators.filter(!ismissing, A))
eltype(Iterators.filter(!ismissing, A))
#Union{Missing, Int64}
Base.IteratorEltype(x for x in A)
Base.IteratorEltype(x for x in A if x isa Int)
Base.IteratorEltype(x::Int for x in A if x isa Int)

If an iterator claims to have a type then this is honored by collect. That’s why collect(A) leaves the container-type (eltype(A)==eltype(collect(A))), whereas collect(x for x in A) looks at the actual element-types. This can be more or less specific than the container type: We have neither eltype(A) <: eltype(collect(x for x in A)) nor eltype(collect(x for x in A))<:eltype(A).

Not sure whether the last one can be reasonably lowered to have an eltype.

1 Like