DataFrames, aggregate with missings

Hi!

I’m facing the following problem:

using DataFrames, Statistics

df = DataFrame(A = [1, 2, missing, missing, missing, 3, 4, 5],
                B = [1, 1, 2, 2, 3, 3, 4, 4])

df_mean = aggregate(df, :B, x -> mean(skipmissing(x)))

The mean function returns NaN when using skipmissing and all the observations in that group are missing. Is there a way to change this behaviour so that it returns missing as well?

Thank you!

The reasoning for this behavior is that a skipmissing of a vector Int[missing, missing] should be have kind of the same behavior as an empty Int vector, Int[]. Mean is just mean(x) = sum(x) / length(x) so it’s clear that mean(Int[]) should return NaN.

Are you coming from Stata, by chance? Julia’s behavior mimics R’s, but Stata propagates missing they way you expect it to.

The best approach would be to make a little helper function

meanmissing(x) = all(ismissing, x) ? missing : mean(skipmissing(x))
3 Likes

This is exactly what I was looking for! Thanks a lot. I’m coming from Stata and Pandas, which in this case seem to behave alike.