I’m looking to print out a summary of missing values in each row and column for a dataframe, but ideally I want to print it alongside the actual dataframe itself – in my head it looks something like this (pardon the crude illustration, but I couldn’t find something that showed what I wanted offhand).
The specific problem I have here is twofold – one, I need a way for the added information to stand out, but that I think I can remedy using PrettyTables.jl. The real problem is that I can’t figure out a way to stack the values together when the types in the dataframe don’t match that of the missing value percentages. Is there any way in which I can do this?
To get started, consider this approach, which can be further decorated using pretty_table
julia> # Create DataFrame from random data with random missing values (not shown)
df = DataFrame(data, :auto)
# Calculate missing percentages for each column
10×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
6 │ 0.229819 -0.890077 0.469477 -1.28902 -1.10708 0.758747 -1.46561 1.14809 -
7 │ -0.421769 0.854242 -1.52688 0.937057 -1.26622 0.525484 0.696236 -0.27827 -
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin ⋯
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
2 columns omitted
julia> col_missing_pct = [sum(ismissing.(df[:, col])) / nrow(df) * 100 for col in names(df)]
# Calculate missing percentages for each row
10-element Vector{Float64}:
30.0
0.0
0.0
10.0
10.0
0.0
0.0
10.0
20.0
20.0
julia> row_missing_pct = [sum(ismissing.(Vector(df[row, :]))) / ncol(df) * 100 for row in 1:nrow(df)]
# Add column with row percentages
10-element Vector{Float64}:
20.0
30.0
0.0
10.0
0.0
0.0
0.0
10.0
30.0
0.0
julia> df.row_missing_pct = row_missing_pct
# Create a summary row with column percentages
10-element Vector{Float64}:
20.0
30.0
0.0
10.0
0.0
0.0
0.0
10.0
30.0
0.0
julia> summary_row = DataFrame()
0×0 DataFrame
julia> for col in names(df)[1:end-1] # Exclude the row_missing_pct column
summary_row[!, col] = [col_missing_pct[findfirst(==(col), names(df)[1:end-1])]]
end
julia> summary_row.row_missing_pct = [missing] # No percentage for the percentage column itself
# Combine original DataFrame with summary row
1-element Vector{Missing}:
missing
julia> df_with_summary = vcat(df, summary_row)
# Rename the last row for clarity
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 2 ⋯
3 columns and 2 rows omitted
julia> rename!(df_with_summary, :row_missing_pct => :row_missing_pct)
# Display the result
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 ⋯
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 ⋯
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 - ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missin
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missin
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 2 ⋯
3 columns and 2 rows omitted
julia> println("DataFrame with 10% missing values:")
DataFrame with 10% missing values:
julia> println("(Last column shows % missing per row, last row shows % missing per column)")
(Last column shows % missing per row, last row shows % missing per column)
julia> println(df_with_summary)
# Check overall percentage of missing values
11×11 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 row_missing_pct
│ Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.808288 1.21928 2.00811 1.90263 missing -1.38289 -2.52561 0.18364 0.566026 missing 20.0
2 │ missing 0.292914 0.76503 missing -1.53561 -0.692244 0.850062 -0.967371 0.615016 missing 30.0
3 │ -1.10464 -0.0311481 0.180254 0.364655 -0.396988 -0.956001 0.627024 0.29823 -0.690379 0.840295 0.0
4 │ missing 0.315833 2.02891 0.785914 1.20248 -1.78927 -0.061257 -0.164931 0.518368 0.379051 10.0
5 │ 0.287588 -2.16238 -0.127637 0.170638 -0.720676 0.161147 0.267154 -1.41293 -1.19626 -0.108625 0.0
6 │ 0.229819 -0.890077 0.469477 -1.28902 -1.10708 0.758747 -1.46561 1.14809 -0.424905 -0.678775 0.0
7 │ -0.421769 0.854242 -1.52688 0.937057 -1.26622 0.525484 0.696236 -0.27827 -0.220797 -1.08822 0.0
8 │ -1.35559 0.341782 0.600132 0.508086 -0.722404 -0.904278 -0.224961 0.336889 missing 0.703758 10.0
9 │ missing -0.31887 -2.05386 -2.24974 0.523905 0.52406 0.0893458 missing missing 0.143326 30.0
10 │ -0.117323 -0.337454 -0.828057 0.125302 1.25942 0.747131 0.921248 0.624411 0.130942 0.148375 0.0
11 │ 30.0 0.0 0.0 10.0 10.0 0.0 0.0 10.0 20.0 20.0 missing
Since this is a presentation table, rather than a data table, you can deal with the type-mismatch potential for the last row by saving it separately, then convert the rest to strings, convert the last row saved out to strings and vcat.
1 Like
Converting to strings fixed it, thank you! Is there a way that I can add a delimiting line after the dataframe and before the percentages while displaying? I looked in PrettyTables.jl but couldn’t find it immediately 
Try this
using PrettyTables
hlines = [0, 1, 11] # Top border, header separator, and before last row
pretty_table(df_with_summary,
show_row_number = false,
hlines = hlines,
header_crayon = crayon"bold",
)