How to calculate a weighted mean with missing observations

In my work not in Julia I often want to find weighted means when i have lots of missing data: both the vector of values contains missing values and the weights have missing values.

using StatsBase, DataFrames
df = DataFrame(
    id = 1:100,
    x = randn(100)
    x_missing = [missing; randn(99)]
    w = rand(1:4, 100)
    w_missing = [rand(99); missing]
)

I want to have mean(df.x_missing, weights(df.w_missing)) correctly drop the observations that are missing either the value or the weight.Stata does this automatically, and R’s weighted mean function has a useful na.rm option.

I can’t use skipmissing or collect(skipmissing) because each values and weights are paired by observation.

How would the future lift function make this easier? Is there a way to do this now without extra data cleaning?

2 Likes

Easy enough to implement, but I agree that it seems like this functionality should maybe be in StatsBase.

function weightedmean(x, w)
    wxsum = wsum = 0
    for (x,w) in zip(x,w)
        wx = w*x
        if !ismissing(wx)
            wxsum += wx
            wsum += w
        end
    end
    return wxsum / wsum
end
3 Likes

This is a good solution.

However this is close to na.rm = TRUE in R, which skipmissing has taken pains to avoid.

I wonder if there is some room for a skipmissing function like
miss_x, miss_y = skipmissing(x, y) that has the missing elements for both dropped?

1 Like

That’s a good idea. Implementation could probably be as simple as:

skipmissing(itrs...) = skipmissing(any(ismissing, tuple) ? missing : tuple for tuple in zip(itrs...))

(zip-like version of skipmissing for multiple iterators? · Issue #30596 · JuliaLang/julia · GitHub)

3 Likes

Or at the DataFrames level, couldn’t you create a dataframe with just x and y. Then you could use completecases to process just those rows with no missings in either column.

In DataFrame 0.16, instead of skipmissing you can use df2 = dropmissing(df, [:x, :y],disallowmissing=true) to remove all rows where x or y is missing (and convert Union{Type,Missing} to Type).
You can then apply any operation on df2 (you can also do this inplace with dropmissing! if you don’t care about missing rows).

True. But it seems like a general enough problem that a solution could exist outside of DataFrames.

I’m just spitballing here, but I could also imagine some macro that does this automatically

@skipmissing foo(@skip a, @skip b, c)

Where a and b get all the indices skipped where a[i] is missing or b[i] is missing.

Gadfly.jl has such a function:

x, y, w = [Vector{Union{Float64, Missing}}(randn(100)) for i in 1:3]
[z[rand(1:100,3)].=missing for z in (x,w)] # Missing values in x and w
x1, y1, w1 = Gadfly.concretize(x,y,w)

We definitely need something to handle this. Passing both vectors to skipmissing is appealing (and could be useful in other situations), but unfortunately it wouldn’t work if we made weights a keyword argument of mean, which was our plan I think.

Another solution would be to have a custom mean method for a SkipMissing first argument which would automatically skip weights corresponding to missing entries (this would only apply if the length of the weights vector is equal to the total number of values, including missing values).

1 Like

skipmissing is so general, and we’ve done so well to avoid keyword arguments so far.

WeightedOnlineStats.jl can deal with missing values in both values and weights:

julia> fit!(WeightedMean(), [1, 2, missing], [1, missing, 3])
WeightedMean: ∑wᵢ=1.0 | value=1.0
1 Like

From NA-ignoring aggregations, two other considerations regarding NA-like values.

First, does it make sense to make missing special? There’s already a PR open that adds skipnothing (or skipoftype). There may still be cases where it’s preferred to skip NaN instead. Should this just be generalized to skip(somevalue, iterator)?

Second, consider the case of a matrix with missing values where you’d like to get the means of the rows. Without missing values, you’d use mean(A, dims=1). But mean(skipmissing(A), dims=1) doesn’t do what you want. Now you can’t really have SkipMissing be a shapeful iterator; really SizeUnknown() is the only sensible return type for Base.IteratorSize(::SkipMissing) even if the underlying iterator has a shape. So I see two options to fix this:

  1. Functions like mean and sum could specifically be overloaded for Base.SkipMissing{<:AbstractArray} so that mean(skipmissing(A), dims=1) just works.
  2. Instead of making mean(skipmissing(A), dims=1) work, you could replace it with [mean(skipmissing(x)) for x in eachslice(A, dims=1)] (with Julia 1.1).

I think the latter is preferable. I’m not sure how many functions like mean and sum have a dims keyword, but it seems like a pain to make them all be SkipMissing-aware. Taking this further, I’m of the opinion that the dims arguments for these functions represent a poor separation of concerns (and I know @andyferris shares this opinion, https://github.com/JuliaArrays/StaticArrays.jl/issues/498#issuecomment-450794143). So should they just be removed in favor of the eachslice version that can be easily made to work with skipmissing?

1 Like

See Support mapreduce over dimensions with SkipMissing by nalimilan · Pull Request #28027 · JuliaLang/julia · GitHub.

Thanks for that link. But to be clear I think that’s not the way to go.

That PR fixes mapreduce and the functions that use it like sum and prod, but you’d have to do the same for reduce. And mean doesn’t actually use mapreduce as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with dims arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the dims keyword arguments and using the eachslice API to separate these concerns.

Which then generalizes as skip(predicate, itr), which is then filter(!predicate, itr), which we already have (also Iterators.filter). It is kind of hard to find the sweet spot between marginally more verbose but general, and various specific functions.

2 Likes

reduce calls mapreduce(identity, ...), and mean calls sum, which calls reduce. I’m not sure why StaticArrays would have to reimplement all of this, since that code should work for any AbstractArray.

The eachslice approach is also useful, but I don’t see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to the mapreduce implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Also eachslice is significantly more verbose and the syntax is quite different from the sum(x, dims=i) one that people are encouraged to use. I think both could coexist: eachslice is more flexible, but sometimes slower.

For performance: to unroll loops and avoid dynamic memory allocation.

FWIW, mean calls mean!, which calls sum!. mean(x; dims), mean!, and sum! would have to be widened to SkipMissing{<:AbstractArray}. It’s not the end of the world to have to do so, but there’s a little more to be done in that branch to get all the dims kwarg functions to work properly.