Hello there!
How do i proceed from here (describe(df, :nmissing)) to get the % of missingness per column in my df dataframe?
Hi.
You may try, e.g. something like
import DataFrames as Dfs
import Statistics as Stats
df = Dfs.DataFrame(
:col1 => rand([missing, 1:6...], 10),
:col2 => rand([missing, 1:6...], 10)
)
map(ismissing, df[!, "col1"]) |> xs -> Stats.mean(xs) * 100
Short explanation:
map executes ismissing function on col1 and returns a vector of Bools, which is then sent to anonymous function that names the vector xs and calculates its mean (Stats.mean(xs), true is treated as 1, false as 0) which is multiplied by 100 to get the result expressed as percentage
For all columns you could go with:
for c in Dfs.names(df)
println("%missing in $c = ",
map(ismissing, df[!, c]) |> xs -> Stats.mean(xs) * 100)
end
describe(df) produces another DataFrame, so you can just add a column to that:
julia> df = DataFrame(rand([missing; 1:6], 10_000, 3), :auto);
julia> x = describe(df, :nmissing);
julia> x.perc_missing = 100 .* x.nmissing ./ nrow(df); x
3×3 DataFrame
Row │ variable nmissing perc_missing
│ Symbol Int64 Float64
─────┼──────────────────────────────────
1 │ x1 1408 14.08
2 │ x2 1432 14.32
3 │ x3 1455 14.55
1 Like
Thank you!