Replace missing values by a group mean

Brinkhuis · June 10, 2022, 8:31pm

In an iris dataframe I replaced some values by missing.
I would like to replace the missing values in the column petal_length by the petal_length mean per species. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

sylvaticus · June 10, 2022, 8:58pm

Are you intentionally using the mean per specie?
Your problem is known as missing imputation, and there are several machine learning algorithms for it.

pdeffebach · June 10, 2022, 10:45pm

How about the following?

using DataFramesMeta
julia> iris_missings = @chain iris begin 
           @transform :petal_length_per_species = begin 
               m = mean(skipmissing(:petal_length))
               [ismissing(l) ? m : l for l in :petal_length]
           end
       end

Brinkhuis · June 11, 2022, 5:16am

I think I found a more optimised way by using groupby.

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end

Brinkhuis · June 11, 2022, 6:07am

Thanks for your reply. However, this does not seem to produce the expected result.
The means before and after replacing missing values should be (approximately) equal.
Based on the compassion below, this does not seem to be the case, isn’t it?

isapprox(
    combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean).mean, 
    combine(groupby(iris_missings, :species), :petal_length .=> mean .=> :mean).mean
)

rocco_sprmnt21 · June 11, 2022, 9:21am

I am not an expert on statistical topics, but I would like to understand better the logic behind the approach you gave to the problem.
What I guess is that by inserting missing in “random” positions within the series of values of one of the quantities, you expect that the average of the non-missing is “approximately equal” to the average of the complete series.
I believe the result of the comparison strongly depends on the type of series you have and on the “choice” criterion of the samples to be set to missing.
However, being in a statistical context what is the “correct” way to compare the two results?
Maybe it’s not that of simple equality or near equality !?

rocco_sprmnt21 · June 11, 2022, 9:26am

making use of the coalesce function to replicate one of the schema used

for group in groupby(iris, :species)
    m=mean(skipmissing(group.sepal_length))
    group.sepal_length.=coalesce.(group.sepal_length,m)
end

pdeffebach · June 11, 2022, 4:45pm

I can’t replicate:

julia> mean_vec_1 = @chain iris begin 
           @by :species :petal_mean = (mean ∘ skipmissing)(:petal_length)
           _.:petal_mean
       end
4-element Vector{Float64}:
 1.4648648648648643
 4.244999999999999
 5.522857142857142
 3.6333333333333337

julia> mean_vec_2 = @chain iris begin 
           groupby(:species)
           @transform :petal_length = begin 
               m = (mean ∘ skipmissing)(:petal_length)
               [ismissing(l) ? m : l for l in :petal_length]
           end
           @by :species :petal_mean = mean(:petal_length)
           _.:petal_mean
       end
4-element Vector{Float64}:
 1.4648648648648646
 4.245
 5.522857142857142
 3.6333333333333337

Topic		Replies	Views
Replacing missings based on values in two DataFrame columns General Usage dataframes	1	309	July 16, 2021
Replacing missing, really Data	8	10769	February 26, 2020
Delete missing values after the last non missing value in each id New to Julia dataframes	7	554	September 1, 2022
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4021	March 29, 2022
Replacing Nan values with the average column values New to Julia dataframes	4	1098	September 28, 2022

Replace missing values by a group mean

Related topics