Hi!
I’m facing the following problem:
using DataFrames, Statistics
df = DataFrame(A = [1, 2, missing, missing, missing, 3, 4, 5],
B = [1, 1, 2, 2, 3, 3, 4, 4])
df_mean = aggregate(df, :B, x -> mean(skipmissing(x)))
The mean function returns NaN
when using skipmissing
and all the observations in that group are missing
. Is there a way to change this behaviour so that it returns missing
as well?
Thank you!
The reasoning for this behavior is that a skipmissing of a vector Int[missing, missing]
should be have kind of the same behavior as an empty Int
vector, Int[]
. Mean is just mean(x) = sum(x) / length(x)
so it’s clear that mean(Int[])
should return NaN.
Are you coming from Stata, by chance? Julia’s behavior mimics R’s, but Stata propagates missing
they way you expect it to.
The best approach would be to make a little helper function
meanmissing(x) = all(ismissing, x) ? missing : mean(skipmissing(x))
3 Likes
This is exactly what I was looking for! Thanks a lot. I’m coming from Stata and Pandas, which in this case seem to behave alike.