RE: Weighted Statistics with Missings

gaspardelanuit · December 12, 2023, 2:45pm

There was a post on this topic a few years ago.

I ran into this recently and thought this would be occasion to mention it again. I have not found an idiomatic way to solve the problem and from discussions of how to implement skipmissing to more than one variable does not seem to have gone anywhere.

I am wondering if someone more knowledgeable might be able to help out … or maybe that is the world we all have to live in.

x = [1, 2, 3, missing]
y = [3, 4, 5, missing]
mean(x, weights(y)) 
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # there has to be a better way!

x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case

nilshg · December 12, 2023, 2:52pm

The answer is no (we just had a 300+ mega-thread about this)

I would probably do:

using DataFrames

d = dropmissing!(DataFrame(; x, y))

mean(d.x, weights(d.y))

which covers both your use cases. DataFrames is of course a bit of a heavy dependency but I generally don’t work with missing values unless I’m doing something that requires DataFrames anyway.

gaspardelanuit · December 12, 2023, 3:01pm

Link?
I did some quick search before posting and could not find anything current.

Note that dropmissing is not ideal because it … drops the missings?
I just want to compute a mean!!

adienes · December 12, 2023, 3:01pm

good summary. my takeaway from that thread is that not much is likely to change unfortunately

nilshg · December 12, 2023, 3:05pm

It doesn’t drop the missings from x or y (you’d need DataFrame(; x, y; copycols = false) for that)

gaspardelanuit · December 12, 2023, 3:36pm

Thank you.
This is a solution. For context this all happens within a data frame combine operation.
Invoking a dataframe “view” is a better (or worse) solution than writing a function (see post at top of the linked thread in my opening).

I am actually not here to complain about missing which is fine.
I am wondering if there has been some developments on operating on arrays. There were discussions of skipmissing(x, y) etc.

github.com/JuliaStats/Statistics.jl

Missing values and weighting

opened 10:52AM - 27 Sep 21 UTC

nalimilan

We currently have an efficient and consistent solution to skip missing values fo…r unweighted single-argument functions via `f(skipmissing(x))`. For multiple-argument functions like `cor` we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87). A reasonable solution would be to use `f(skipmissing(x), weights=w)`, with a typical definition being: ```julia function f(s::SkipMissing{<:AbstractVector}; weights::AbstractVector) size(s.x) == size(weights) || throw(DimensionMismatch()) inds= find(!ismissing, s.x) f(view(s.x, inds), weights=view(weights, inds)) end ``` That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in `skipmissing`. A wrapper like `skipmissing(weighted(x, w))` (inspired by what was proposed at https://github.com/JuliaLang/julia/pull/33310) would be cleaner in that regard. But that would still be quite ad-hoc, as `skipmissing` currently only accepts collections (and `weighted` cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods. The generalization to multiple-argument functions poses the same challenges as `cor`. For these, the simplest solution would be to use a `skipmissing` keyword argument, a bit like [`pairwise`](https://juliastats.org/StatsBase.jl/latest/misc/#StatsAPI.pairwise). Again, the alternative would be to use wrappers like `skipmissing(weighted(w, x, y))`. Overall, the problem is that we have conflicting goals: - be able to skip missing values with functions that don't have any special support for them using `f(skipmissing(x))` - use a similar syntax for unweighted and weighted functions, e.g. `f(skipmissing(x))` vs `f(skipmissing(x), weights=w)`, or `f(skipmissing(x))` vs `f(skipmissing(weighted(x, w)))`, or `f(x, skipmissing=true)` vs `f(x, skipmissing=true, weights=w)` - use a similar syntax for single- and multiple-argument functions, e.g. `f(skipmissing(x))` vs `f(skipmissing(x, y))`, or `f(x, skipmissing=true)` vs `f(x, y, skipmissing=true)` - use a similar syntax for simple functions operating on vectors (like `mean`) and complex functions operating on whole tables (like `fit(MODEL, ..., data=df, weights=w)` and which skip missing values by default)

To be clear. This is not an indictment of the decision of missing propagation etc. This more about convenience functions that I think could be integrated in the language or at least into statistics.

pdeffebach · December 12, 2023, 3:38pm

At the moment, skipmissings(x, y) (not the “s”) is your best bet

sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))

gaspardelanuit · December 12, 2023, 3:40pm

Thanks, this is great and probably the most efficient.

I guess the GitHub discussion closed the door to having something of the sort:
mean(x, y; iterate_on=skipmissings.(x,y))

But I will take the win!

gaspardelanuit · December 12, 2023, 7:59pm

For the sake of closing this I just want to note that there is a small typo in the earlier answer.
It should be skipmissings(x,y) and not skipmissings.(x,y) such that this runs fine:

x = [1,2,3,missing]
y=[4,5,missing,6]
sx, sy= collect.(skipmissings(x,y))
mean(sx, weights(sy))

CameronBieganek · December 12, 2023, 10:59pm

For the sake of clarity for future readers, the skipmissings function is from the Missings.jl package. So the complete example is

using Missings

x = [1, 2, 3, missing]
y = [4, 5, missing, 6]
sx, sy = collect.(skipmissings(x, y))
mean(sx, weights(sy))

CameronBieganek · December 12, 2023, 11:04pm

Although it should also be mentioned that Missings.skipmissings has been deprecated for some reason.

$ julia --depwarn=yes

julia> using Missings

julia> sx, sy = skipmissings([1, 2, missing], [4, missing, 6]);
┌ Warning: Current design of skipmissings is deprecated. In future
│ releases this function may be redesigned or removed
│   caller = top-level scope at REPL[2]:1
└ @ Core REPL[2]:1

I’m not sure why it was deprecated. Perhaps @pdeffebach can comment.

pdeffebach · December 12, 2023, 11:42pm

The motivation for deprecating skipmissings is mostly Base devs not wanting to have cor(sx, sy) work. Base devs didn’t want to define a method for cor(iter1, iter2) since it isn’t guaranteed that the indices match if iter1 and iter2 don’t have indices (meaning they aren’t arrays).

So skipmissings wasn’t able to really solve the problem it set out to solve. At least, not without collect.

But we wanted to release a 1.0 version of Missings.jl. So we deprecated it hoping to make a better feature in the future.

Ideally, the upcoming spreadmissings will supersede skipmissings for most use-cases. Once that is merged, we can decide if we should remove skipmissings or not.

CameronBieganek · December 13, 2023, 12:44am

Do you mean that it isn’t guaranteed that iter1 and iter2 have the same length? Because cor could just branch on the Base.IteratorSize trait, and throw an error for SizeUnknown and IsInfinite.

juliohm · December 13, 2023, 2:19am

Yet another solution, this time with TableTransforms.jl:

# named tuple is a table
(; x, w) |> DropMissing()

The DropMissing works like the DataFrames.jl one, but works with any Tables.jl table, and has powerful column selection syntax in case you don’t know the name of the variables in advance.

pdeffebach · December 13, 2023, 2:47pm

No, it’s not just the length. It was about whether the nth element of iter1 really can be matched with the nth element of iter2.

And additionally, the implementations of skipmissing and skipmissings don’t allow for length and give SizeUnknown, since all it does is iterate and doesn’t actually keep track of where the missing values are.

CameronBieganek · December 13, 2023, 3:12pm

I don’t understand this argument. Is the argument that vectors are ordered and iterators are not ordered? Of course there are some types for which the order of iteration is undefined, like sets and dictionaries, but most types have a well-defined order of iteration (including vectors). It seems natural to identify the n-th element of iter1 with the n-th element of iter2. I don’t see how that’s any different from identifying the n-th element of a vector x with the n-th element of a vector y. What am I missing?

nilshg · December 13, 2023, 3:20pm

If I understand Peter correctly it’s about the problematic example in the OP:

x = [1, 2, missing, 3]
y = [4, 5, 6, missing]
mean(collect(skipmissing(x)), weights(collect(skipmissing(y)))) # cumbersome syntax is not robust and does not give the intended results in this case

Where we’d want to compute cor([1,2], [4,5]) not cor([1,2,3], [4,5,6]).

pdeffebach · December 13, 2023, 3:27pm

See discussion here

cor(skipmissing(x), skipmissing(y)) is obviously wrong. But Statistics.jl can’t distinguish between cor(skipmissings(x, y)...) and cor(skipmissing(x), skipmissing(y)). So without being able to enforce this, base devs prefer cor work on an iterator of pairs. I pointed out that skipmissings is, in general, more useful as a tuple of iterators rather than an iterator of tuples, so I would prefer to allow cor(sx, sy).

There’s also the issue that the dispatching behavior of cor is pretty complicated, and without an “iterator” type, deciding cor(::Any) or cor(::Any, ::Any) is a big deal.

CameronBieganek · December 13, 2023, 4:00pm

I’ve read that comment before, but it’s not very enlightening. (In fact, the assertion “iterators don’t in general have a strong guarantee over their ordering” is worrying. Does that mean I cannot rely on 1:2 to iterate 1 first and then 2? The docstring for UnitRange makes no mention of iteration order…)

It’s not hard to use the existing cor in an invalid way. For example,

using Random
cor(x, shuffle(y))

I’m onboard with that change. Unfortunately it would require releasing version 2.0 of the Statistics standard library, which probably won’t happen anytime soon.

pdeffebach · December 13, 2023, 4:51pm

One note about iterator of pairs, though,

cor(X)

returns a 2x2 matrix when X is an Nx2 matrix. You get ones along the diagonal and the x-y correlations on the off-diagonal entries.

Returning something similar for a iterator of length-2 tuples would be very annoying, especially as a replacement for cor(x, y), which currently returns a single float value. I’m worried people would want matrix behavior when the obvious use-case (for me) is to return a scalar.

Topic		Replies	Views
How to calculate a weighted mean with missing observations Statistics	17	5035	January 5, 2019
Statistics.mean() function with a Matrix containing missing values New to Julia	8	1172	February 6, 2023
Why are missing values not ignored by default? Internals & Design data , missing-values	330	8389	January 17, 2024
How does StatsBase.skewness work? Data	29	2630	January 29, 2019
DataFrames, aggregate with missings Data dataframes	2	560	May 4, 2020

RE: Weighted Statistics with Missings

Related topics