Why are missing values not ignored by default?

That’s an interesting thought. I’m not sure if using ? as an identifier is a viable solution, but I’m much more a fan of adding a function than adding new syntax from a design perspective (leaving aside my appreciation of predicate? functions), which I find strong motivation for exploring the potential of ? as an identifier over ? as new syntax if possible.

We can allow ? as a general purpose postfix operator, it doesn’t have to be only for missings. Then Missings.jl would just be one package making use of it.

It could look like this:

  1. Allow ? as a postfix operator like ' (but without any method definition in Base, initially at least)
  2. In Missings.jl define ? as a shorthand for skipmissing (this could also go in a new ShortMissings.jl package or whatever)
  3. Fix mean to work well with skipmissing (cf this comment)
  4. Add overloads to cor, etc. to properly support things like cor(skipmissing(a), skipmissing(b))
  5. Add support for skipmissing-boolean indexing in DataFrames.jl

At this point we can do things like this:

mean(x?)
cor(x?, v?)   # same behavior as polars or pandas

df[(df.x .> 0)?, :]

But let’s go further:

  1. In Missings.jl define ?(f::Function) to make wrappers that skip missing values:

    cor?(x,v) would mean MissingSkipper(cor)(x,v) which would eventually call cor(x?, v?)

It’s a nice shortcut but especially useful for cases like this:

combine(gdf, :value => mean?)

Here’s a working prototype with instead of ?:

struct MissingSkipper{T}
    f::T
end

var"'ˢ"(x::AbstractArray) = skipmissing(x)
var"'ˢ"(f::Base.Callable) = MissingSkipper(f)
(s::MissingSkipper)(args...; kwargs...) =
    (s.f)(skipmissing.(args)...; kwargs...)

x = [1, 2, missing, 3]

julia> mean(x'ˢ)
2.0

julia> mean'ˢ(x)
2.0

# would work if cor(x::SkipMissing, y::SkipMissing) was defined:
julia> cor'ˢ(x, x)
1 Like

I see.

Is it possible that Statistics implements either (both) of the naive algorithms, but not something more numerically stable, such as

function avg(x)
    a = zero(eltype(x))
    for i in eachindex(x)
        a += (x[i] - a) / i;
    end
    return a
end 

?

julia> x = rand(Float16, 10^6);

julia> mean(x)
NaN

julia> avg(x)
4.92920e-01

2 Likes

I think the f?(x) looks pretty clean and is certainly easier to type, but I do have some reservations about adding a third meaning to ?. Imagine trying to teach someone Julia and saying, if you type a question mark it like this

julia> ?f

it means get help for f. If you type it like this

d = a ? b : c

it means return b if a is true. But if you type it like this

julia> f?(x)

it means skip missing values when applying f.

Perhaps this isn’t too hard to reason about, but I could imagine someone constructing a pathological case where they are trying to skip missing values in a ternary operator and have ? all over the place.

2 Likes

Yes - you can see the implementation in the source code of Statistics.jl.

Thanks, the links from the documentation are broken! :frowning:

One does wonder why mean(itr) does not use the numerically stable streaming algorithm. (Your implementation can be generalized to work for non-indexable iterators.) It seems to work quite nicely:

julia> x = rand(Float16, 10^6);

julia> mean(x)
NaN16

julia> avg(x)
Float16(0.498)
1 Like

With a more sane definition of the mean of an iterator,

function avg(itr)
    a = zero(eltype(itr))
    for (i, x) in enumerate(itr)
        a += (x - a) / i
    end
    return a
end

we get

julia> let x = rand(Float32, 100_000_000), sx = skipmissing(x)
           mean(x), mean(sx), avg(sx)
       end
(0.4999912f0, 0.16777216f0, 0.4999802f0)

With the MissingSkipper wrapper proposed above, the behavior of adding skipmissing to every argument would just be a fallback. The wrapper function can be overloaded to have the proper behavior for each function:

(s::MissingSkipper{typeof(quantile)})(x, p; kwargs...) =
    quantile(skipmissing(x), p; kwargs...)

Then quantile?(x, p) would work as expected.

But if ? supports both vectors and functions as proposed then we could also just write

quantile(xs?, [0.1, 0.2])
reduce(+, xs?)
1 Like
1 Like

We should add a mean(itr::SkipMissing) method which calls sum(x -> coalesce(x, zero(x)), itr.x) so that pairwise summation is used. The actual definition will be a little trickier as mean accumulates values in the return type to avoid overflows (e.g. Float64 when input values are Int8).

EDIT: The problem is that contrary to arrays, we don’t know the number of elements in advance, so we can’t just call sum. As for iterators, we need to counts elements and accumulate them at the same time.

1 Like

But the underlying issue is not SkipMissing, it’s the implementation of mean for arbitrary iterators. It affects other iterators besides just SkipMissing:

julia> A = rand(Float32, 100_000_000);

julia> itr = (rand(Float32) for _ in 1:100_000_000);

julia> mean(A), mean(itr)
(0.5000529f0, 0.16777216f0)

Why can’t we have a streaming mean implementation for arbitrary iterators that has better numerical accuracy? I’ve opened a Github issue about it:

You don’t want cor(x?, y?) to expand to cor(skipmissing(x), skipmissing(y)), but to cor(skipmissings(x, y)...) where skipmissings is the function defined in Missings.jl. More generally, f(x?, y, z?) needs to expand to something like

_x, _z = skipmissings(x, z)
f(_x, y, _z)

The point is that the set of elements to skip is a joint property of all the iterators participating in the skipping, so you can’t simply apply skipmissing to each iterator individually. I’m not sure you can accomplish this with a postfix operator alone, but probably with a macro given parser support.

3 Likes

What you propose is one way to do it, but it’s hard-coding the skip behavior before the call to cor which is less flexible than my version.

If we expand cor(x?, y?) to cor(skipmissing(x), skipmissing(y)), then we can overload cor(x::SkipMissing, y::SkipMissing) to do the right thing: the equivalent of skipmissings as you say, or maybe something else if so instructed by a keyword argument.

Note that skipmissing(x) is not doing anything except putting a wrapper around x. So you can do it to individual arguments, send that to the function and leave the function to deal with it properly. I think in some cases skipmissings is not the right thing to do, so it’s good to have the flexibility of dealing with each argument as appropriate in each particular function.

Why not? It even works for functions like quantile(x, p) where only one argument should get skipmissing, as described here.

Edit: mmm maybe what I say works technically but violates the semantics of skipmissing? Then it could be solved by using a different wrapper type?

There’s a certain sort of analogy here with broadcasting. We could’ve similarly chosen to have each argument opt into a broadcast with something like f(.x, y, .z) instead of f.(x, Ref(y), z). Or we could’ve gone even more explicit and had a way of tagging which dimensions should be “extruded” (or repeated) out. While there’s things to be desired with the status quo, it does cover most of the typical use-cases.

In other words, it could make sense to do the “all arguments” thing and maybe have a simple way of “excepting” some arguments from consideration. The goal here — like broadcasting — would be a simple shorthand that gets most folks most of the way there for most usages.

6 Likes

I suppose what you could do without macros is make the MissingSkipper function wrapper a flag to change the semantics of SkipMissing to act jointly across arguments rather than individually. In that case, you would probably not be interested in cor(x?, y?), but rather cor?(x?, y?). That’s a lot of question marks, but it could work.

But @mbauman makes a good point with the comparison to broadcasting and the argument for keeping it simple. If f?(args...) means f(skipmissings(args)...), you could always write

myquantiles(x) = quantile(x, [0.1, 0.5, 0.9])
myquantiles?(y)

I’d rather change the documentation of skipmissing(itr) from

Return an iterator over the elements in itr skipping missing values. The returned object can be indexed using indices of itr if the latter is indexable. […]

to

Return a SkipMissing wrapper for itr. The result is an iterator that skips missing values. Functions can dispatch on the SkipMissing type to implement appropriate behaviors and algorithms for handling missing values. The returned object can be indexed using indices of itr if the latter is indexable. […]

Then we can use SkipMissing (the already existing wrapper used by skipmissing) as a flag, and both cor?(x, y) and cor(x?, y?) would make perfect sense.

A 3rd useful option is skipping an iteration if missing is found in only some of the iterables, not all.

No, I don’t think this would be useful, because we want fun?(x, y) to work even if the person who wrote fun didn’t ever think about missing. If ? is included, then Julia performs some reasonable pre-processing to make it work.

1 Like

Isn’t that exactly what skipmissings does?