Improve type handling in skipmissing

Would be nice to improve skipmissing, maybe always apply collect? If the reason not to apply it is the performance, I would say - predictability and principle of least surprise is more important than performance.

serie = [1.0, missing, 1.0, 4.0]
function process(serie)
  [value for value in serie]
end

println(process(serie) == [1.0, missing, 1.0, 4.0])
# => missing - Good!

println(skipmissing(process(serie)) == skipmissing([1.0, missing, 1.0, 4.0]))
# => false - Surprise, not good.

println(collect(skipmissing(process(serie))) == collect(skipmissing([1.0, missing, 1.0, 4.0])))
# => true

Once we know that skipmissing is a lazy evaluation, I think it should not come as a surprise that we can’t know the result until it is evaluated.

1 Like

The reason for skipmissing to return an iterator instead of a collected array is not only performance but also space. Think about it, when you’re handling a large dataset (maybe a 10GB csv) and you just always collect after skipmissing, you’re going to maybe double the memory requirements (at least in the short term) because a new array for the non-missing entries has to be created, instead of using the existing memory and just… skipping missings when iterating.

1 Like
julia> (x for x in serie) == (x for x in serie)
false

It seems that all iterators are compared strictly for identity (===), which I find a bit surprising, but I suppose there was a good reason for it.

Note that you’re comparing generators in your example, not iterators. Iterators only exist as part of an implemented interface of functions, not as concrete subtypes of some abstract type. See here. Collecting a generator or an iterator to check whether they produce the same output goes against their core idea of lazy evaluation, though I suppose their fields could be checked for identity instead.

In any case, checking for identity is defined as a fallback for any comparison in case no specialized version exists. You can check for yourself via julia> @edit (x for x in serie) == (x for x in serie) and you’ll see that you will end up in the generic definition of ==(x,y) in operators.jl.

Does it? I would believe that === and IdDicts are perfectly valid ways of comparing iterators by their identity. Using identity for == might be correct, but it seems like a lost opportunity for a meaningful and distinct operation. How else should you check that the two iterators contain the same data, without a wasteful collect? all(==, iter1, iter2) doesn’t work, sadly. all(Iterators.map(==, iter1, iter2)) ?

FWIW Python3 works the same way, so I must be missing something.

Iterating over iterators to check whether they produce the same result is fine, collecting (i.e., allocating space to hold the resulting collection) to check for equality by default is not since you don’t necessarily know how large the collection is when comparing.

Also:

though I suppose their fields [of the struct used to dispatch on the iteration interface] could be checked for identity instead.

So in that sense, I think we agree :slight_smile: I do wonder if you really want to iterate over them, since IteratorSize, length and size are only optional for iterators in general and thus comparing iterators by iterating can potentially lead to very long unintended runtimes.

2 Likes

If that is the behaviour you desire and you don’t find yourself caring about performance / memory allocations, you can simply define

myskipmissing(args...) = collect(skipmissing(args...))

The reason I think the current default is better is that you can always collect an iterator, but once one has already been materialized for you, there’s no undoing those allocations.

The reason Julia doesn’t provide both is that we want to avoid cluttering the namespace of functions exported from base, so we mostly avoid providing functions that can be trivially written as the composition of two functions (I.e. collect and skipmissing).

1 Like

It’s not just performance. Iterators may be stateful (e.g., eachline) so == cannot iterate over elements when the input is (x for x in stateful_iter) or skipmissing(stateful_iter).

2 Likes