DataFrames, aggregate with missings

donquicote · May 4, 2020, 1:13pm

Hi!

I’m facing the following problem:

using DataFrames, Statistics

df = DataFrame(A = [1, 2, missing, missing, missing, 3, 4, 5],
                B = [1, 1, 2, 2, 3, 3, 4, 4])

df_mean = aggregate(df, :B, x -> mean(skipmissing(x)))

The mean function returns NaN when using skipmissing and all the observations in that group are missing. Is there a way to change this behaviour so that it returns missing as well?

Thank you!

pdeffebach · May 4, 2020, 3:01pm

The reasoning for this behavior is that a skipmissing of a vector Int[missing, missing] should be have kind of the same behavior as an empty Int vector, Int[]. Mean is just mean(x) = sum(x) / length(x) so it’s clear that mean(Int[]) should return NaN.

Are you coming from Stata, by chance? Julia’s behavior mimics R’s, but Stata propagates missing they way you expect it to.

The best approach would be to make a little helper function

meanmissing(x) = all(ismissing, x) ? missing : mean(skipmissing(x))

donquicote · May 4, 2020, 3:25pm

This is exactly what I was looking for! Thanks a lot. I’m coming from Stata and Pandas, which in this case seem to behave alike.

Topic		Replies	Views
How can I skip missing values of a DF without deleating them? New to Julia dataframes	4	698	November 11, 2021
Compute mean of array where all values could be missing New to Julia	5	393	April 21, 2021
Statistics.mean() function with a Matrix containing missing values New to Julia	8	1170	February 6, 2023
Iterate over all numeric columns in DataFrames Data	21	4854	February 11, 2018
Ignoring NaNs when calculating means of columns of a dataframe General Usage dataframes , nan	1	504	February 17, 2024

DataFrames, aggregate with missings

Related topics