In my work not in Julia I often want to find weighted means when i have lots of missing data: both the vector of values contains missing values and the weights have missing values.
using StatsBase, DataFrames
df = DataFrame(
id = 1:100,
x = randn(100)
x_missing = [missing; randn(99)]
w = rand(1:4, 100)
w_missing = [rand(99); missing]
)
I want to have mean(df.x_missing, weights(df.w_missing)) correctly drop the observations that are missing either the value or the weight.Stata does this automatically, and R’s weighted mean function has a useful na.rm option.
I can’t use skipmissing or collect(skipmissing) because each values and weights are paired by observation.
How would the future lift function make this easier? Is there a way to do this now without extra data cleaning?
Or at the DataFrames level, couldn’t you create a dataframe with just x and y. Then you could use completecases to process just those rows with no missings in either column.
In DataFrame 0.16, instead of skipmissing you can use df2 = dropmissing(df, [:x, :y],disallowmissing=true) to remove all rows where x or y is missing (and convert Union{Type,Missing} to Type).
You can then apply any operation on df2 (you can also do this inplace with dropmissing! if you don’t care about missing rows).
x, y, w = [Vector{Union{Float64, Missing}}(randn(100)) for i in 1:3]
[z[rand(1:100,3)].=missing for z in (x,w)] # Missing values in x and w
x1, y1, w1 = Gadfly.concretize(x,y,w)
We definitely need something to handle this. Passing both vectors to skipmissing is appealing (and could be useful in other situations), but unfortunately it wouldn’t work if we made weights a keyword argument of mean, which was our plan I think.
Another solution would be to have a custom mean method for a SkipMissing first argument which would automatically skip weights corresponding to missing entries (this would only apply if the length of the weights vector is equal to the total number of values, including missing values).
First, does it make sense to make missing special? There’s already a PR open that adds skipnothing (or skipoftype). There may still be cases where it’s preferred to skip NaN instead. Should this just be generalized to skip(somevalue, iterator)?
Second, consider the case of a matrix with missing values where you’d like to get the means of the rows. Without missing values, you’d use mean(A, dims=1). But mean(skipmissing(A), dims=1) doesn’t do what you want. Now you can’t really have SkipMissing be a shapeful iterator; really SizeUnknown() is the only sensible return type for Base.IteratorSize(::SkipMissing) even if the underlying iterator has a shape. So I see two options to fix this:
Functions like mean and sum could specifically be overloaded for Base.SkipMissing{<:AbstractArray} so that mean(skipmissing(A), dims=1) just works.
Instead of making mean(skipmissing(A), dims=1) work, you could replace it with [mean(skipmissing(x)) for x in eachslice(A, dims=1)] (with Julia 1.1).
I think the latter is preferable. I’m not sure how many functions like mean and sum have a dims keyword, but it seems like a pain to make them all be SkipMissing-aware. Taking this further, I’m of the opinion that the dims arguments for these functions represent a poor separation of concerns (and I know @andyferris shares this opinion, https://github.com/JuliaArrays/StaticArrays.jl/issues/498#issuecomment-450794143). So should they just be removed in favor of the eachslice version that can be easily made to work with skipmissing?
Thanks for that link. But to be clear I think that’s not the way to go.
That PR fixes mapreduce and the functions that use it like sum and prod, but you’d have to do the same for reduce. And mean doesn’t actually use mapreduce as far as I can tell, so you’d have to fix that as well. How many others are there? On top of that, StaticArrays has to reimplement these functions with dims arguments as well, so there’s even more implementation cost and opportunity for bugs. All of this could be avoided by just not having the dims keyword arguments and using the eachslice API to separate these concerns.
Which then generalizes as skip(predicate, itr), which is then filter(!predicate, itr), which we already have (also Iterators.filter). It is kind of hard to find the sweet spot between marginally more verbose but general, and various specific functions.
reduce calls mapreduce(identity, ...), and mean calls sum, which calls reduce. I’m not sure why StaticArrays would have to reimplement all of this, since that code should work for any AbstractArray.
The eachslice approach is also useful, but I don’t see it as a good alternative since it cannot be efficient when reducing over a non-contiguous slice (e.g. rows for a column-major matrix), contrary to the mapreduce implementation which processes entries in the most efficient order, accumulating results in parallel for all dimensions if needed. Also eachslice is significantly more verbose and the syntax is quite different from the sum(x, dims=i) one that people are encouraged to use. I think both could coexist: eachslice is more flexible, but sometimes slower.
FWIW, mean calls mean!, which calls sum!. mean(x; dims), mean!, and sum! would have to be widened to SkipMissing{<:AbstractArray}. It’s not the end of the world to have to do so, but there’s a little more to be done in that branch to get all the dims kwarg functions to work properly.