Hello there!
How do i proceed from here (describe(df, :nmissing)) to get the % of missingness per column in my df dataframe?
Hi.
You may try, e.g. something like
import DataFrames as Dfs
import Statistics as Stats
df = Dfs.DataFrame(
:col1 => rand([missing, 1:6...], 10),
:col2 => rand([missing, 1:6...], 10)
)
map(ismissing, df[!, "col1"]) |> xs -> Stats.mean(xs) * 100
Short explanation:
map
executes ismissing
function on col1
and returns a vector of Bools, which is then sent to anonymous function that names the vector xs
and calculates its mean (Stats.mean(xs)
, true is treated as 1, false as 0) which is multiplied by 100 to get the result expressed as percentage
For all columns you could go with:
for c in Dfs.names(df)
println("%missing in $c = ",
map(ismissing, df[!, c]) |> xs -> Stats.mean(xs) * 100)
end
describe(df)
produces another DataFrame, so you can just add a column to that:
julia> df = DataFrame(rand([missing; 1:6], 10_000, 3), :auto);
julia> x = describe(df, :nmissing);
julia> x.perc_missing = 100 .* x.nmissing ./ nrow(df); x
3×3 DataFrame
Row │ variable nmissing perc_missing
│ Symbol Int64 Float64
─────┼──────────────────────────────────
1 │ x1 1408 14.08
2 │ x2 1432 14.32
3 │ x3 1455 14.55
1 Like
Thank you!