RE: Weighted Statistics with Missings

There was a post on this topic a few years ago.

I ran into this recently and thought this would be occasion to mention it again. I have not found an idiomatic way to solve the problem and from discussions of how to implement skipmissing to more than one variable does not seem to have gone anywhere.

I am wondering if someone more knowledgeable might be able to help out … or maybe that is the world we all have to live in.

x = [1, 2, 3, missing]
y = [3, 4, 5, missing]
mean(x, weights(y)) 
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # there has to be a better way!

x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case
1 Like

The answer is no (we just had a 300+ mega-thread about this)

I would probably do:

using DataFrames

d = dropmissing!(DataFrame(; x, y))

mean(d.x, weights(d.y))

which covers both your use cases. DataFrames is of course a bit of a heavy dependency but I generally don’t work with missing values unless I’m doing something that requires DataFrames anyway.

I did some quick search before posting and could not find anything current.

Note that dropmissing is not ideal because it … drops the missings?
I just want to compute a mean!!

good summary. my takeaway from that thread is that not much is likely to change unfortunately

It doesn’t drop the missings from x or y (you’d need DataFrame(; x, y; copycols = false) for that)

Thank you.
This is a solution. For context this all happens within a data frame combine operation.
Invoking a dataframe “view” is a better (or worse) solution than writing a function (see post at top of the linked thread in my opening).

I am actually not here to complain about missing which is fine.
I am wondering if there has been some developments on operating on arrays. There were discussions of skipmissing(x, y) etc.

To be clear. This is not an indictment of the decision of missing propagation etc. This more about convenience functions that I think could be integrated in the language or at least into statistics.

At the moment, skipmissings(x, y) (not the “s”) is your best bet

sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))
1 Like

Thanks, this is great and probably the most efficient.

I guess the GitHub discussion closed the door to having something of the sort:
mean(x, y; iterate_on=skipmissings.(x,y))

But I will take the win!

For the sake of closing this I just want to note that there is a small typo in the earlier answer.
It should be skipmissings(x,y) and not skipmissings.(x,y) such that this runs fine:

x = [1,2,3,missing]
sx, sy= collect.(skipmissings(x,y))
mean(sx, weights(sy))

For the sake of clarity for future readers, the skipmissings function is from the Missings.jl package. So the complete example is

using Missings

x = [1, 2, 3, missing]
y = [4, 5, missing, 6]
sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))
1 Like

Although it should also be mentioned that Missings.skipmissings has been deprecated for some reason.

$ julia --depwarn=yes
julia> using Missings

julia> sx, sy = skipmissings([1, 2, missing], [4, missing, 6]);
┌ Warning: Current design of skipmissings is deprecated. In future
│ releases this function may be redesigned or removed
│   caller = top-level scope at REPL[2]:1
└ @ Core REPL[2]:1

I’m not sure why it was deprecated. Perhaps @pdeffebach can comment.

The motivation for deprecating skipmissings is mostly Base devs not wanting to have cor(sx, sy) work. Base devs didn’t want to define a method for cor(iter1, iter2) since it isn’t guaranteed that the indices match if iter1 and iter2 don’t have indices (meaning they aren’t arrays).

So skipmissings wasn’t able to really solve the problem it set out to solve. At least, not without collect.

But we wanted to release a 1.0 version of Missings.jl. So we deprecated it hoping to make a better feature in the future.

Ideally, the upcoming spreadmissings will supersede skipmissings for most use-cases. Once that is merged, we can decide if we should remove skipmissings or not.

1 Like

Do you mean that it isn’t guaranteed that iter1 and iter2 have the same length? Because cor could just branch on the Base.IteratorSize trait, and throw an error for SizeUnknown and IsInfinite.

Yet another solution, this time with TableTransforms.jl:

# named tuple is a table
(; x, w) |> DropMissing()

The DropMissing works like the DataFrames.jl one, but works with any Tables.jl table, and has powerful column selection syntax in case you don’t know the name of the variables in advance.

1 Like

No, it’s not just the length. It was about whether the nth element of iter1 really can be matched with the nth element of iter2.

And additionally, the implementations of skipmissing and skipmissings don’t allow for length and give SizeUnknown, since all it does is iterate and doesn’t actually keep track of where the missing values are.

I don’t understand this argument. Is the argument that vectors are ordered and iterators are not ordered? Of course there are some types for which the order of iteration is undefined, like sets and dictionaries, but most types have a well-defined order of iteration (including vectors). It seems natural to identify the n-th element of iter1 with the n-th element of iter2. I don’t see how that’s any different from identifying the n-th element of a vector x with the n-th element of a vector y. What am I missing?

If I understand Peter correctly it’s about the problematic example in the OP:

x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case

Where we’d want to compute cor([1,2], [4,5]) not cor([1,2,3], [4,5,6]).

See discussion here

cor(skipmissing(x), skipmissing(y)) is obviously wrong. But Statistics.jl can’t distinguish between cor(skipmissings(x, y)...) and cor(skipmissing(x), skipmissing(y)). So without being able to enforce this, base devs prefer cor work on an iterator of pairs. I pointed out that skipmissings is, in general, more useful as a tuple of iterators rather than an iterator of tuples, so I would prefer to allow cor(sx, sy).

There’s also the issue that the dispatching behavior of cor is pretty complicated, and without an “iterator” type, deciding cor(::Any) or cor(::Any, ::Any) is a big deal.

I’ve read that comment before, but it’s not very enlightening. (In fact, the assertion “iterators don’t in general have a strong guarantee over their ordering” is worrying. Does that mean I cannot rely on 1:2 to iterate 1 first and then 2? The docstring for UnitRange makes no mention of iteration order…)

It’s not hard to use the existing cor in an invalid way. For example,

using Random
cor(x, shuffle(y))

I’m onboard with that change. Unfortunately it would require releasing version 2.0 of the Statistics standard library, which probably won’t happen anytime soon.

One note about iterator of pairs, though,


returns a 2x2 matrix when X is an Nx2 matrix. You get ones along the diagonal and the x-y correlations on the off-diagonal entries.

Returning something similar for a iterator of length-2 tuples would be very annoying, especially as a replacement for cor(x, y), which currently returns a single float value. I’m worried people would want matrix behavior when the obvious use-case (for me) is to return a scalar.