# RE: Weighted Statistics with Missings

There was a post on this topic a few years ago.

I ran into this recently and thought this would be occasion to mention it again. I have not found an idiomatic way to solve the problem and from discussions of how to implement skipmissing to more than one variable does not seem to have gone anywhere.

I am wondering if someone more knowledgeable might be able to help out … or maybe that is the world we all have to live in.

``````x = [1, 2, 3, missing]
y = [3, 4, 5, missing]
mean(x, weights(y))
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # there has to be a better way!

x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case
``````
1 Like

I would probably do:

``````using DataFrames

d = dropmissing!(DataFrame(; x, y))

mean(d.x, weights(d.y))
``````

which covers both your use cases. `DataFrames` is of course a bit of a heavy dependency but I generally don’t work with `missing` values unless I’m doing something that requires `DataFrames` anyway.

I did some quick search before posting and could not find anything current.

Note that dropmissing is not ideal because it … drops the missings?
I just want to compute a mean!!

good summary. my takeaway from that thread is that not much is likely to change unfortunately

It doesn’t drop the missings from `x` or `y` (you’d need `DataFrame(; x, y; copycols = false)` for that)

Thank you.
This is a solution. For context this all happens within a data frame combine operation.
Invoking a dataframe “view” is a better (or worse) solution than writing a function (see post at top of the linked thread in my opening).

I am actually not here to complain about missing which is fine.
I am wondering if there has been some developments on operating on arrays. There were discussions of skipmissing(x, y) etc.

To be clear. This is not an indictment of the decision of missing propagation etc. This more about convenience functions that I think could be integrated in the language or at least into statistics.

At the moment, `skipmissings(x, y)` (not the “s”) is your best bet

``````sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))
``````
1 Like

Thanks, this is great and probably the most efficient.

I guess the GitHub discussion closed the door to having something of the sort:
`mean(x, y; iterate_on=skipmissings.(x,y))`

But I will take the win!

For the sake of closing this I just want to note that there is a small typo in the earlier answer.
It should be `skipmissings(x,y)` and not `skipmissings.(x,y)` such that this runs fine:

``````x = [1,2,3,missing]
y=[4,5,missing,6]
sx, sy= collect.(skipmissings(x,y))
mean(sx, weights(sy))
``````
3 Likes

For the sake of clarity for future readers, the `skipmissings` function is from the Missings.jl package. So the complete example is

``````using Missings

x = [1, 2, 3, missing]
y = [4, 5, missing, 6]
sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))
``````
1 Like

Although it should also be mentioned that `Missings.skipmissings` has been deprecated for some reason.

``````\$ julia --depwarn=yes
``````
``````julia> using Missings

julia> sx, sy = skipmissings([1, 2, missing], [4, missing, 6]);
┌ Warning: Current design of skipmissings is deprecated. In future
│ releases this function may be redesigned or removed
│   caller = top-level scope at REPL[2]:1
└ @ Core REPL[2]:1
``````

I’m not sure why it was deprecated. Perhaps @pdeffebach can comment.

The motivation for deprecating `skipmissings` is mostly Base devs not wanting to have `cor(sx, sy)` work. Base devs didn’t want to define a method for `cor(iter1, iter2)` since it isn’t guaranteed that the indices match if `iter1` and `iter2` don’t have indices (meaning they aren’t arrays).

So `skipmissings` wasn’t able to really solve the problem it set out to solve. At least, not without `collect`.

But we wanted to release a 1.0 version of Missings.jl. So we deprecated it hoping to make a better feature in the future.

Ideally, the upcoming `spreadmissings` will supersede `skipmissings` for most use-cases. Once that is merged, we can decide if we should remove `skipmissings` or not.

1 Like

Do you mean that it isn’t guaranteed that `iter1` and `iter2` have the same length? Because `cor` could just branch on the `Base.IteratorSize` trait, and throw an error for `SizeUnknown` and `IsInfinite`.

Yet another solution, this time with TableTransforms.jl:

``````# named tuple is a table
(; x, w) |> DropMissing()
``````

The `DropMissing` works like the DataFrames.jl one, but works with any Tables.jl table, and has powerful column selection syntax in case you don’t know the name of the variables in advance.

1 Like

No, it’s not just the length. It was about whether the nth element of `iter1` really can be matched with the nth element of `iter2`.

And additionally, the implementations of `skipmissing` and `skipmissings` don’t allow for `length` and give `SizeUnknown`, since all it does is `iterate` and doesn’t actually keep track of where the `missing` values are.

I don’t understand this argument. Is the argument that vectors are ordered and iterators are not ordered? Of course there are some types for which the order of iteration is undefined, like sets and dictionaries, but most types have a well-defined order of iteration (including vectors). It seems natural to identify the n-th element of `iter1` with the n-th element of `iter2`. I don’t see how that’s any different from identifying the n-th element of a vector `x` with the n-th element of a vector `y`. What am I missing?

If I understand Peter correctly it’s about the problematic example in the OP:

``````x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case
``````

Where we’d want to compute `cor([1,2], [4,5])` not `cor([1,2,3], [4,5,6])`.

See discussion here

`cor(skipmissing(x), skipmissing(y))` is obviously wrong. But Statistics.jl can’t distinguish between `cor(skipmissings(x, y)...)` and `cor(skipmissing(x), skipmissing(y))`. So without being able to enforce this, base devs prefer `cor` work on an iterator of pairs. I pointed out that `skipmissings` is, in general, more useful as a tuple of iterators rather than an iterator of tuples, so I would prefer to allow `cor(sx, sy)`.

There’s also the issue that the dispatching behavior of `cor` is pretty complicated, and without an “iterator” type, deciding `cor(::Any)` or `cor(::Any, ::Any)` is a big deal.

I’ve read that comment before, but it’s not very enlightening. (In fact, the assertion “iterators don’t in general have a strong guarantee over their ordering” is worrying. Does that mean I cannot rely on `1:2` to iterate `1` first and then `2`? The docstring for `UnitRange` makes no mention of iteration order…)

It’s not hard to use the existing `cor` in an invalid way. For example,

``````using Random
cor(x, shuffle(y))
``````

I’m onboard with that change. Unfortunately it would require releasing version 2.0 of the Statistics standard library, which probably won’t happen anytime soon.

One note about iterator of pairs, though,

``````cor(X)
``````

returns a 2x2 matrix when `X` is an `Nx2` matrix. You get ones along the diagonal and the `x-y` correlations on the off-diagonal entries.

Returning something similar for a iterator of length-2 tuples would be very annoying, especially as a replacement for `cor(x, y)`, which currently returns a single float value. I’m worried people would want matrix behavior when the obvious use-case (for me) is to return a scalar.