# How to calculate a weighted mean with missing observations

In my work not in Julia I often want to find weighted means when i have lots of missing data: both the vector of values contains missing values and the weights have missing values.

``````using StatsBase, DataFrames
df = DataFrame(
id = 1:100,
x = randn(100)
x_missing = [missing; randn(99)]
w = rand(1:4, 100)
w_missing = [rand(99); missing]
)
``````

I want to have `mean(df.x_missing, weights(df.w_missing))` correctly drop the observations that are missing either the value or the weight.Stata does this automatically, and R’s weighted mean function has a useful `na.rm` option.

I can’t use `skipmissing` or `collect(skipmissing)` because each values and weights are paired by observation.

How would the future `lift` function make this easier? Is there a way to do this now without extra data cleaning?

2 Likes

Easy enough to implement, but I agree that it seems like this functionality should maybe be in StatsBase.

``````function weightedmean(x, w)
wxsum = wsum = 0
for (x,w) in zip(x,w)
wx = w*x
if !ismissing(wx)
wxsum += wx
wsum += w
end
end
return wxsum / wsum
end
``````
3 Likes

This is a good solution.

However this is close to `na.rm = TRUE` in R, which `skipmissing` has taken pains to avoid.

I wonder if there is some room for a `skipmissing` function like
`miss_x, miss_y = skipmissing(x, y)` that has the missing elements for both dropped?

1 Like

That’s a good idea. Implementation could probably be as simple as:

``````skipmissing(itrs...) = skipmissing(any(ismissing, tuple) ? missing : tuple for tuple in zip(itrs...))
``````
3 Likes

Or at the DataFrames level, couldn’t you create a dataframe with just x and y. Then you could use completecases to process just those rows with no missings in either column.

In DataFrame 0.16, instead of skipmissing you can use df2 = dropmissing(df, [:x, :y],disallowmissing=true) to remove all rows where x or y is missing (and convert Union{Type,Missing} to Type).
You can then apply any operation on df2 (you can also do this inplace with dropmissing! if you don’t care about missing rows).

True. But it seems like a general enough problem that a solution could exist outside of DataFrames.

I’m just spitballing here, but I could also imagine some macro that does this automatically

``````@skipmissing foo(@skip a, @skip b, c)
``````

Where `a` and `b` get all the indices skipped where `a[i]` is missing or `b[i]` is missing.

``````x, y, w = [Vector{Union{Float64, Missing}}(randn(100)) for i in 1:3]
[z[rand(1:100,3)].=missing for z in (x,w)] # Missing values in x and w

``````

We definitely need something to handle this. Passing both vectors to `skipmissing` is appealing (and could be useful in other situations), but unfortunately it wouldn’t work if we made weights a keyword argument of `mean`, which was our plan I think.

Another solution would be to have a custom `mean` method for a `SkipMissing` first argument which would automatically skip weights corresponding to missing entries (this would only apply if the length of the weights vector is equal to the total number of values, including missing values).

1 Like

`skipmissing` is so general, and we’ve done so well to avoid keyword arguments so far.

`WeightedOnlineStats.jl` can deal with missing values in both values and weights:

``````julia> fit!(WeightedMean(), [1, 2, missing], [1, missing, 3])
WeightedMean: ∑wᵢ=1.0 | value=1.0
``````
1 Like

From NA-ignoring aggregations, two other considerations regarding NA-like values.

First, does it make sense to make `missing` special? There’s already a PR open that adds `skipnothing` (or `skipoftype`). There may still be cases where it’s preferred to skip `NaN` instead. Should this just be generalized to `skip(somevalue, iterator)`?

Second, consider the case of a matrix with missing values where you’d like to get the means of the rows. Without `missing` values, you’d use `mean(A, dims=1)`. But `mean(skipmissing(A), dims=1)` doesn’t do what you want. Now you can’t really have `SkipMissing` be a shapeful iterator; really `SizeUnknown()` is the only sensible return type for `Base.IteratorSize(::SkipMissing)` even if the underlying iterator has a shape. So I see two options to fix this:

1. Functions like `mean` and `sum` could specifically be overloaded for `Base.SkipMissing{<:AbstractArray}` so that `mean(skipmissing(A), dims=1)` just works.
2. Instead of making `mean(skipmissing(A), dims=1)` work, you could replace it with `[mean(skipmissing(x)) for x in eachslice(A, dims=1)]` (with Julia 1.1).

I think the latter is preferable. I’m not sure how many functions like `mean` and `sum` have a `dims` keyword, but it seems like a pain to make them all be `SkipMissing`-aware. Taking this further, I’m of the opinion that the `dims` arguments for these functions represent a poor separation of concerns (and I know @andyferris shares this opinion, https://github.com/JuliaArrays/StaticArrays.jl/issues/498#issuecomment-450794143). So should they just be removed in favor of the `eachslice` version that can be easily made to work with `skipmissing`?

1 Like

Thanks for that link. But to be clear I think that’s not the way to go.

That PR fixes `mapreduce` and the functions that use it like `sum` and `prod`, but you’d have to do the same for `reduce`. And `mean` doesn’t actually use `mapreduce` as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with `dims` arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the `dims` keyword arguments and using the `eachslice` API to separate these concerns.

Which then generalizes as `skip(predicate, itr)`, which is then `filter(!predicate, itr)`, which we already have (also `Iterators.filter`). It is kind of hard to find the sweet spot between marginally more verbose but general, and various specific functions.

2 Likes

`reduce` calls `mapreduce(identity, ...)`, and `mean` calls `sum`, which calls `reduce`. I’m not sure why StaticArrays would have to reimplement all of this, since that code should work for any `AbstractArray`.

The `eachslice` approach is also useful, but I don’t see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to the `mapreduce` implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Also `eachslice` is significantly more verbose and the syntax is quite different from the `sum(x, dims=i)` one that people are encouraged to use. I think both could coexist: `eachslice` is more flexible, but sometimes slower.

For performance: to unroll loops and avoid dynamic memory allocation.

FWIW, `mean` calls `mean!`, which calls `sum!`. `mean(x; dims)`, `mean!`, and `sum!` would have to be widened to `SkipMissing{<:AbstractArray}`. It’s not the end of the world to have to do so, but there’s a little more to be done in that branch to get all the `dims` kwarg functions to work properly.