% of missingness per column

Hello there!
How do i proceed from here (describe(df, :nmissing)) to get the % of missingness per column in my df dataframe?

Hi.

You may try, e.g. something like

import DataFrames as Dfs
import Statistics as Stats

df = Dfs.DataFrame(
    :col1 => rand([missing, 1:6...], 10), 
    :col2 => rand([missing, 1:6...], 10)
)

map(ismissing, df[!, "col1"]) |> xs -> Stats.mean(xs) * 100

Short explanation:
map executes ismissing function on col1 and returns a vector of Bools, which is then sent to anonymous function that names the vector xs and calculates its mean (Stats.mean(xs), true is treated as 1, false as 0) which is multiplied by 100 to get the result expressed as percentage

For all columns you could go with:

for c in Dfs.names(df)
    println("%missing in $c = ", 
            map(ismissing, df[!, c]) |> xs -> Stats.mean(xs) * 100)
end

describe(df) produces another DataFrame, so you can just add a column to that:

julia> df = DataFrame(rand([missing; 1:6], 10_000, 3), :auto);

julia> x = describe(df, :nmissing);

julia> x.perc_missing = 100 .* x.nmissing ./ nrow(df); x
3×3 DataFrame
 Row │ variable  nmissing  perc_missing
     │ Symbol    Int64     Float64
─────┼──────────────────────────────────
   1 │ x1            1408         14.08
   2 │ x2            1432         14.32
   3 │ x3            1455         14.55
1 Like

Thank you!