In an iris
dataframe I replaced some values by missing
.
I would like to replace the missing
values in the column petal_length
by the petal_length
mean per species
. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?
using CSV
using DataFrames
using Random
using Statistics
using StatsBase
download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)
Random.seed!(20_000)
for i in 1:100
iris[rand(1:nrow(iris)), rand(1:4)] = missing
end
Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing
mean_per_species = combine(groupby(iris, :species), :petal_length .=> meanâskipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)
for row in eachrow(iris)
if ismissing(row.petal_length)
row.petal_length = mean_per_species_dict[row.species]
end
end
Are you intentionally using the mean per specie?
Your problem is known as missing imputation, and there are several machine learning algorithms for it.
How about the following?
using DataFramesMeta
julia> iris_missings = @chain iris begin
@transform :petal_length_per_species = begin
m = mean(skipmissing(:petal_length))
[ismissing(l) ? m : l for l in :petal_length]
end
end
I think I found a more optimised way by using groupby
.
for group in groupby(iris, :species)
group[ismissing.(group.petal_length), :petal_length] .= mean(skipmissing(group.petal_length))
end
1 Like
Thanks for your reply. However, this does not seem to produce the expected result.
The means before and after replacing missing values should be (approximately) equal.
Based on the compassion below, this does not seem to be the case, isnât it?
isapprox(
combine(groupby(iris, :species), :petal_length .=> meanâskipmissing .=> :mean).mean,
combine(groupby(iris_missings, :species), :petal_length .=> mean .=> :mean).mean
)
I am not an expert on statistical topics, but I would like to understand better the logic behind the approach you gave to the problem.
What I guess is that by inserting missing in ârandomâ positions within the series of values of one of the quantities, you expect that the average of the non-missing is âapproximately equalâ to the average of the complete series.
I believe the result of the comparison strongly depends on the type of series you have and on the âchoiceâ criterion of the samples to be set to missing.
However, being in a statistical context what is the âcorrectâ way to compare the two results?
Maybe itâs not that of simple equality or near equality !?
making use of the coalesce function to replicate one of the schema used
for group in groupby(iris, :species)
m=mean(skipmissing(group.sepal_length))
group.sepal_length.=coalesce.(group.sepal_length,m)
end
I canât replicate:
julia> mean_vec_1 = @chain iris begin
@by :species :petal_mean = (mean â skipmissing)(:petal_length)
_.:petal_mean
end
4-element Vector{Float64}:
1.4648648648648643
4.244999999999999
5.522857142857142
3.6333333333333337
julia> mean_vec_2 = @chain iris begin
groupby(:species)
@transform :petal_length = begin
m = (mean â skipmissing)(:petal_length)
[ismissing(l) ? m : l for l in :petal_length]
end
@by :species :petal_mean = mean(:petal_length)
_.:petal_mean
end
4-element Vector{Float64}:
1.4648648648648646
4.245
5.522857142857142
3.6333333333333337