I’m looking to print out a summary of missing values in each row and column for a dataframe, but ideally I want to print it alongside the actual dataframe itself – in my head it looks something like this (pardon the crude illustration, but I couldn’t find something that showed what I wanted offhand).
The specific problem I have here is twofold – one, I need a way for the added information to stand out, but that I think I can remedy using PrettyTables.jl. The real problem is that I can’t figure out a way to stack the values together when the types in the dataframe don’t match that of the missing value percentages. Is there any way in which I can do this?
To get started, consider this approach, which can be further decorated using pretty_table
julia> # Create DataFrame from random data with random missing values (not shown)
df = DataFrame(data, :auto)
# Calculate missing percentages for each column
10×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
6 │ 0.229819 -0.890077 0.469477 -1.28902 -1.10708 0.758747 -1.46561 1.14809 -
7 │ -0.421769 0.854242 -1.52688 0.937057 -1.26622 0.525484 0.696236 -0.27827 -
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin ⋯
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
2 columns omitted
julia> col_missing_pct = [sum(ismissing.(df[:, col])) / nrow(df) * 100 for col in names(df)]
# Calculate missing percentages for each row
10-element Vector{Float64}:
30.0
0.0
0.0
10.0
10.0
0.0
0.0
10.0
20.0
20.0
julia> row_missing_pct = [sum(ismissing.(Vector(df[row, :]))) / ncol(df) * 100 for row in 1:nrow(df)]
# Add column with row percentages
10-element Vector{Float64}:
20.0
30.0
0.0
10.0
0.0
0.0
0.0
10.0
30.0
0.0
julia> df.row_missing_pct = row_missing_pct
# Create a summary row with column percentages
10-element Vector{Float64}:
20.0
30.0
0.0
10.0
0.0
0.0
0.0
10.0
30.0
0.0
julia> summary_row = DataFrame()
0×0 DataFrame
julia> for col in names(df)[1:end-1] # Exclude the row_missing_pct column
summary_row[!, col] = [col_missing_pct[findfirst(==(col), names(df)[1:end-1])]]
end
julia> summary_row.row_missing_pct = [missing] # No percentage for the percentage column itself
# Combine original DataFrame with summary row
1-element Vector{Missing}:
missing
julia> df_with_summary = vcat(df, summary_row)
# Rename the last row for clarity
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 2 ⋯
3 columns and 2 rows omitted
julia> rename!(df_with_summary, :row_missing_pct => :row_missing_pct)
# Display the result
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 2 ⋯
3 columns and 2 rows omitted
julia> println("DataFrame with 10% missing values:")
DataFrame with 10% missing values:
julia> println("(Last column shows % missing per row, last row shows % missing per column)")
(Last column shows % missing per row, last row shows % missing per column)
julia> println(df_with_summary)
# Check overall percentage of missing values
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 row_missing_pct
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 0.566026 missing 20.0
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371 0.615016 missing 30.0
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -0.690379 0.840295 0.0
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931 0.518368 0.379051 10.0
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 -1.19626 -0.108625 0.0
6 │ 0.229819 -0.890077 0.469477 -1.28902 -1.10708 0.758747 -1.46561 1.14809 -0.424905 -0.678775 0.0
7 │ -0.421769 0.854242 -1.52688 0.937057 -1.26622 0.525484 0.696236 -0.27827 -0.220797 -1.08822 0.0
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missing 0.703758 10.0
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missing 0.143326 30.0
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411 0.130942 0.148375 0.0
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 20.0 20.0 missing
Since this is a presentation table, rather than a data table, you can deal with the type-mismatch potential for the last row by saving it separately, then convert the rest to strings, convert the last row saved out to strings and vcat.
Converting to strings fixed it, thank you! Is there a way that I can add a delimiting line after the dataframe and before the percentages while displaying? I looked in PrettyTables.jl but couldn’t find it immediately 
Try this
using PrettyTables
hlines = [0, 1, 11] # Top border, header separator, and before last row
pretty_table(df_with_summary,
show_row_number = false,
hlines = hlines,
header_crayon = crayon"bold",
)
Thank you so much, that works perfectly!