Given a DataFrame with fields :a
and :b
such that :a
is a group variable and :b
is a value variable which has many missing values. What is the optimal method to fill the missing values in :b
to satisfy the following condition? Replace missing :b
value for each group in :a
with the closest non-missing row value in :b
within that group.
Here is one method. Are there better alternatives? The method employs a function defined within a loop over groups. The function references and updates a default value within the loop. A transform!
within the loop uses the function to update the group/parent data frame.
# first, the df
let
df = DataFrame(a=repeat([1, 2, 3, 4], inner=[10]), b = repeat([[missing, missing, 50, missing, missing]; 10; repeat([missing], 4)], outer=[4]))
end
# now the update method
let
df = DataFrame(a=repeat([1, 2, 3, 4], inner=[10]), b = repeat([[missing, missing, 50, missing, missing]; 10; repeat([missing], 4)], outer=[4]))
gdf = groupby(df, :a)
for g in gdf
dvalue = coalesce(g.b...)
function cvalue(v)
cvalue = coalesce(v, dvalue)
dvalue = cvalue
end
transform!(g, :b => ByRow(cvalue) => :c)
end
return df
end
Thanks.
filldown(v)=accumulate((x,y)->coalesce(y,x), v,init=v[1])
fillup(v)=reverse(filldown(reverse(v)))
combine(combine(gdf, :b=>filldown =>:b),:b=>fillup)
Your example does not do āthe closest non-missing row valueā but rather carries forward values and falls back to backward if the first observations are missing. If this is what you want apart from what @rocco_sprmnt21 proposed you can use a combination of locf
and nocb
functions from Impute.jl.
Both solutions work. And in working through them, I learned more base Julia from @rocco_sprmnt21 and about a neat package from @bkamins who is also correct in restating my problem.
Thanks so much. What a great community!
Slightly modified for clarity and summarized for future seekers:
using DataFrames, Impute
filldown(v) = accumulate( (x, y) -> coalesce(y, x), v, init = v[1] )
fillup(v) = reverse(filldown(reverse(v)))
let
df = DataFrame(a=repeat([1, 2, 3, 4], inner=[10]), b = repeat([[missing, missing, 50, missing, missing]; 10; repeat([missing], 4)], outer=[4]))
gdf = groupby(df, :a)
for g in gdf
transform!(g, :b => filldown => :c)
transform!(g, :c => fillup => :c)
end
return df
end
let
df = DataFrame(a=repeat([1, 2, 3, 4], inner=[10]), b = repeat([[missing, missing, 50, missing, missing]; 10; repeat([missing], 4)], outer=[4]))
gdf = groupby(df, :a)
for g in gdf
transform!(g, :b => Impute.locf => :c)
transform!(g, :c => Impute.nocb => :c)
end
return df
end
1 Like
Note that you can also do:
df = DataFrame(a=repeat([1, 2, 3, 4], inner=[10]), b = repeat([[missing, missing, 50, missing, missing]; 10; repeat([missing], 4)], outer=[4]))
transform!(groupby(df, :a), :b => Impute.nocbāImpute.locf => :c)
1 Like