Replace missing values by a group mean

In an iris dataframe I replaced some values by missing.
I would like to replace the missing values in the column petal_length by the petal_length mean per species. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

Are you intentionally using the mean per specie?
Your problem is known as missing imputation, and there are several machine learning algorithms for it.

How about the following?

using DataFramesMeta
julia> iris_missings = @chain iris begin 
           @transform :petal_length_per_species = begin 
               m = mean(skipmissing(:petal_length))
               [ismissing(l) ? m : l for l in :petal_length]
           end
       end

I think I found a more optimised way by using groupby.

for group in groupby(iris, :species)
    group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end
1 Like

Thanks for your reply. However, this does not seem to produce the expected result.
The means before and after replacing missing values should be (approximately) equal.
Based on the compassion below, this does not seem to be the case, isn’t it?

isapprox(
    combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean).mean, 
    combine(groupby(iris_missings, :species), :petal_length .=> mean .=> :mean).mean
)

I am not an expert on statistical topics, but I would like to understand better the logic behind the approach you gave to the problem.
What I guess is that by inserting missing in “random” positions within the series of values of one of the quantities, you expect that the average of the non-missing is “approximately equal” to the average of the complete series.
I believe the result of the comparison strongly depends on the type of series you have and on the “choice” criterion of the samples to be set to missing.
However, being in a statistical context what is the “correct” way to compare the two results?
Maybe it’s not that of simple equality or near equality !?

making use of the coalesce function to replicate one of the schema used

for group in groupby(iris, :species)
    m=mean(skipmissing(group.sepal_length))
    group.sepal_length.=coalesce.(group.sepal_length,m)
end

I can’t replicate:

julia> mean_vec_1 = @chain iris begin 
           @by :species :petal_mean = (mean ∘ skipmissing)(:petal_length)
           _.:petal_mean
       end
4-element Vector{Float64}:
 1.4648648648648643
 4.244999999999999
 5.522857142857142
 3.6333333333333337

julia> mean_vec_2 = @chain iris begin 
           groupby(:species)
           @transform :petal_length = begin 
               m = (mean ∘ skipmissing)(:petal_length)
               [ismissing(l) ? m : l for l in :petal_length]
           end
           @by :species :petal_mean = mean(:petal_length)
           _.:petal_mean
       end
4-element Vector{Float64}:
 1.4648648648648646
 4.245
 5.522857142857142
 3.6333333333333337