Pretty print summary statistics for a dataframe

I’m looking to print out a summary of missing values in each row and column for a dataframe, but ideally I want to print it alongside the actual dataframe itself – in my head it looks something like this (pardon the crude illustration, but I couldn’t find something that showed what I wanted offhand).

The specific problem I have here is twofold – one, I need a way for the added information to stand out, but that I think I can remedy using PrettyTables.jl. The real problem is that I can’t figure out a way to stack the values together when the types in the dataframe don’t match that of the missing value percentages. Is there any way in which I can do this?

To get started, consider this approach, which can be further decorated using pretty_table

julia> # Create DataFrame from random data with random missing values (not shown)
       df = DataFrame(data, :auto)

       # Calculate missing percentages for each column
10×10 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
   6 │       0.229819  -0.890077    0.469477       -1.28902        -1.10708    0.758747  -1.46561          1.14809        -
   7 │      -0.421769   0.854242   -1.52688         0.937057       -1.26622    0.525484   0.696236        -0.27827        -
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin ⋯
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
                                                                                                            2 columns omitted

julia> col_missing_pct = [sum(ismissing.(df[:, col])) / nrow(df) * 100 for col in names(df)]

       # Calculate missing percentages for each row
10-element Vector{Float64}:
 30.0
  0.0
  0.0
 10.0
 10.0
  0.0
  0.0
 10.0
 20.0
 20.0

julia> row_missing_pct = [sum(ismissing.(Vector(df[row, :]))) / ncol(df) * 100 for row in 1:nrow(df)]

       # Add column with row percentages
10-element Vector{Float64}:
 20.0
 30.0
  0.0
 10.0
  0.0
  0.0
  0.0
 10.0
 30.0
  0.0

julia> df.row_missing_pct = row_missing_pct

       # Create a summary row with column percentages
10-element Vector{Float64}:
 20.0
 30.0
  0.0
 10.0
  0.0
  0.0
  0.0
 10.0
 30.0
  0.0

julia> summary_row = DataFrame()
0×0 DataFrame

julia> for col in names(df)[1:end-1]  # Exclude the row_missing_pct column
           summary_row[!, col] = [col_missing_pct[findfirst(==(col), names(df)[1:end-1])]]
       end

julia> summary_row.row_missing_pct = [missing]  # No percentage for the percentage column itself

       # Combine original DataFrame with summary row
1-element Vector{Missing}:
 missing

julia> df_with_summary = vcat(df, summary_row)

       # Rename the last row for clarity
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
  ⋮  │       ⋮             ⋮           ⋮            ⋮               ⋮             ⋮          ⋮             ⋮                ⋱
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            2 ⋯
                                                                                                 3 columns and 2 rows omitted

julia> rename!(df_with_summary, :row_missing_pct => :row_missing_pct)

       # Display the result
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
  ⋮  │       ⋮             ⋮           ⋮            ⋮               ⋮             ⋮          ⋮             ⋮                ⋱
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            2 ⋯
                                                                                                 3 columns and 2 rows omitted

julia> println("DataFrame with 10% missing values:")
DataFrame with 10% missing values:

julia> println("(Last column shows % missing per row, last row shows % missing per column)")
(Last column shows % missing per row, last row shows % missing per column)

julia> println(df_with_summary)

       # Check overall percentage of missing values
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9              x10             row_missing_pct 
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float64?        Float64?        Float64?        
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364         0.566026  missing                    20.0
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371        0.615016  missing                    30.0
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -0.690379        0.840295              0.0
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931        0.518368        0.379051             10.0
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        -1.19626        -0.108625              0.0
   6 │       0.229819  -0.890077    0.469477       -1.28902        -1.10708    0.758747  -1.46561          1.14809        -0.424905       -0.678775              0.0
   7 │      -0.421769   0.854242   -1.52688         0.937057       -1.26622    0.525484   0.696236        -0.27827        -0.220797       -1.08822               0.0
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missing               0.703758             10.0
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missing               0.143326             30.0
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411        0.130942        0.148375              0.0
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            20.0            20.0             missing   

Since this is a presentation table, rather than a data table, you can deal with the type-mismatch potential for the last row by saving it separately, then convert the rest to strings, convert the last row saved out to strings and vcat.

Converting to strings fixed it, thank you! Is there a way that I can add a delimiting line after the dataframe and before the percentages while displaying? I looked in PrettyTables.jl but couldn’t find it immediately :sweat_smile:

Try this

using PrettyTables

hlines = [0, 1, 11]  # Top border, header separator, and before last row

pretty_table(df_with_summary, 
    show_row_number = false,
    hlines = hlines,
    header_crayon = crayon"bold",
)

Thank you so much, that works perfectly!