How to calculate a weighted mean with missing observations

pdeffebach · January 4, 2019, 8:01pm

In my work not in Julia I often want to find weighted means when i have lots of missing data: both the vector of values contains missing values and the weights have missing values.

using StatsBase, DataFrames
df = DataFrame(
    id = 1:100,
    x = randn(100)
    x_missing = [missing; randn(99)]
    w = rand(1:4, 100)
    w_missing = [rand(99); missing]
)

I want to have mean(df.x_missing, weights(df.w_missing)) correctly drop the observations that are missing either the value or the weight.Stata does this automatically, and R’s weighted mean function has a useful na.rm option.

I can’t use skipmissing or collect(skipmissing) because each values and weights are paired by observation.

How would the future lift function make this easier? Is there a way to do this now without extra data cleaning?

stevengj · January 4, 2019, 8:48pm

Easy enough to implement, but I agree that it seems like this functionality should maybe be in StatsBase.

function weightedmean(x, w)
    wxsum = wsum = 0
    for (x,w) in zip(x,w)
        wx = w*x
        if !ismissing(wx)
            wxsum += wx
            wsum += w
        end
    end
    return wxsum / wsum
end

pdeffebach · January 4, 2019, 8:52pm

This is a good solution.

However this is close to na.rm = TRUE in R, which skipmissing has taken pains to avoid.

I wonder if there is some room for a skipmissing function like
miss_x, miss_y = skipmissing(x, y) that has the missing elements for both dropped?

stevengj · January 4, 2019, 8:58pm

That’s a good idea. Implementation could probably be as simple as:

skipmissing(itrs...) = skipmissing(any(ismissing, tuple) ? missing : tuple for tuple in zip(itrs...))

(zip-like version of skipmissing for multiple iterators? · Issue #30596 · JuliaLang/julia · GitHub)

js135005 · January 4, 2019, 10:04pm

Or at the DataFrames level, couldn’t you create a dataframe with just x and y. Then you could use completecases to process just those rows with no missings in either column.

raphbacher · January 4, 2019, 10:10pm

In DataFrame 0.16, instead of skipmissing you can use df2 = dropmissing(df, [:x, :y],disallowmissing=true) to remove all rows where x or y is missing (and convert Union{Type,Missing} to Type).
You can then apply any operation on df2 (you can also do this inplace with dropmissing! if you don’t care about missing rows).

pdeffebach · January 4, 2019, 11:49pm

True. But it seems like a general enough problem that a solution could exist outside of DataFrames.

I’m just spitballing here, but I could also imagine some macro that does this automatically

@skipmissing foo(@skip a, @skip b, c)

Where a and b get all the indices skipped where a[i] is missing or b[i] is missing.

Mattriks · January 5, 2019, 7:03am

Gadfly.jl has such a function:

x, y, w = [Vector{Union{Float64, Missing}}(randn(100)) for i in 1:3]
[z[rand(1:100,3)].=missing for z in (x,w)] # Missing values in x and w
x1, y1, w1 = Gadfly.concretize(x,y,w)

nalimilan · January 5, 2019, 9:47am

We definitely need something to handle this. Passing both vectors to skipmissing is appealing (and could be useful in other situations), but unfortunately it wouldn’t work if we made weights a keyword argument of mean, which was our plan I think.

Another solution would be to have a custom mean method for a SkipMissing first argument which would automatically skip weights corresponding to missing entries (this would only apply if the length of the weights vector is equal to the total number of values, including missing values).

pdeffebach · January 5, 2019, 3:07pm

skipmissing is so general, and we’ve done so well to avoid keyword arguments so far.

gdkrmr · January 5, 2019, 3:15pm

WeightedOnlineStats.jl can deal with missing values in both values and weights:

julia> fit!(WeightedMean(), [1, 2, missing], [1, missing, 3])
WeightedMean: ∑wᵢ=1.0 | value=1.0

tkoolen · January 5, 2019, 4:16pm

From NA-ignoring aggregations, two other considerations regarding NA-like values.

First, does it make sense to make missing special? There’s already a PR open that adds skipnothing (or skipoftype). There may still be cases where it’s preferred to skip NaN instead. Should this just be generalized to skip(somevalue, iterator)?

Second, consider the case of a matrix with missing values where you’d like to get the means of the rows. Without missing values, you’d use mean(A, dims=1). But mean(skipmissing(A), dims=1) doesn’t do what you want. Now you can’t really have SkipMissing be a shapeful iterator; really SizeUnknown() is the only sensible return type for Base.IteratorSize(::SkipMissing) even if the underlying iterator has a shape. So I see two options to fix this:

Functions like mean and sum could specifically be overloaded for Base.SkipMissing{<:AbstractArray} so that mean(skipmissing(A), dims=1) just works.
Instead of making mean(skipmissing(A), dims=1) work, you could replace it with [mean(skipmissing(x)) for x in eachslice(A, dims=1)] (with Julia 1.1).

I think the latter is preferable. I’m not sure how many functions like mean and sum have a dims keyword, but it seems like a pain to make them all be SkipMissing-aware. Taking this further, I’m of the opinion that the dims arguments for these functions represent a poor separation of concerns (and I know @andyferris shares this opinion, https://github.com/JuliaArrays/StaticArrays.jl/issues/498#issuecomment-450794143). So should they just be removed in favor of the eachslice version that can be easily made to work with skipmissing?

nalimilan · January 5, 2019, 4:26pm

See Support mapreduce over dimensions with SkipMissing by nalimilan · Pull Request #28027 · JuliaLang/julia · GitHub.

tkoolen · January 5, 2019, 4:40pm

Thanks for that link. But to be clear I think that’s not the way to go.

That PR fixes mapreduce and the functions that use it like sum and prod, but you’d have to do the same for reduce. And mean doesn’t actually use mapreduce as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with dims arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the dims keyword arguments and using the eachslice API to separate these concerns.

Tamas_Papp · January 5, 2019, 4:46pm

Which then generalizes as skip(predicate, itr), which is then filter(!predicate, itr), which we already have (also Iterators.filter). It is kind of hard to find the sweet spot between marginally more verbose but general, and various specific functions.

nalimilan · January 5, 2019, 4:59pm

reduce calls mapreduce(identity, ...), and mean calls sum, which calls reduce. I’m not sure why StaticArrays would have to reimplement all of this, since that code should work for any AbstractArray.

The eachslice approach is also useful, but I don’t see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to the mapreduce implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Also eachslice is significantly more verbose and the syntax is quite different from the sum(x, dims=i) one that people are encouraged to use. I think both could coexist: eachslice is more flexible, but sometimes slower.

tkoolen · January 5, 2019, 5:05pm

For performance: to unroll loops and avoid dynamic memory allocation.

tkoolen · January 5, 2019, 6:18pm

FWIW, mean calls mean!, which calls sum!. mean(x; dims), mean!, and sum! would have to be widened to SkipMissing{<:AbstractArray}. It’s not the end of the world to have to do so, but there’s a little more to be done in that branch to get all the dims kwarg functions to work properly.

Topic		Replies	Views
RE: Weighted Statistics with Missings Statistics dataframes	19	742	December 13, 2023
Statistics.mean() function with a Matrix containing missing values New to Julia	8	1170	February 6, 2023
DataFrames, aggregate with missings Data dataframes	2	560	May 4, 2020
Compute mean of array where all values could be missing New to Julia	5	393	April 21, 2021
Arithmetic operations on multi-dimensional arrays with Missings New to Julia	4	835	June 27, 2018

How to calculate a weighted mean with missing observations

Related topics