Pretty print summary statistics for a dataframe

I’m looking to print out a summary of missing values in each row and column for a dataframe, but ideally I want to print it alongside the actual dataframe itself – in my head it looks something like this (pardon the crude illustration, but I couldn’t find something that showed what I wanted offhand).

The specific problem I have here is twofold – one, I need a way for the added information to stand out, but that I think I can remedy using PrettyTables.jl. The real problem is that I can’t figure out a way to stack the values together when the types in the dataframe don’t match that of the missing value percentages. Is there any way in which I can do this?

To get started, consider this approach, which can be further decorated using pretty_table

julia> # Create DataFrame from random data with random missing values (not shown)
       df = DataFrame(data, :auto)

       # Calculate missing percentages for each column
10×10 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
   6 │       0.229819  -0.890077    0.469477       -1.28902        -1.10708    0.758747  -1.46561          1.14809        -
   7 │      -0.421769   0.854242   -1.52688         0.937057       -1.26622    0.525484   0.696236        -0.27827        -
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin ⋯
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
                                                                                                            2 columns omitted

julia> col_missing_pct = [sum(ismissing.(df[:, col])) / nrow(df) * 100 for col in names(df)]

       # Calculate missing percentages for each row
10-element Vector{Float64}:
 30.0
  0.0
  0.0
 10.0
 10.0
  0.0
  0.0
 10.0
 20.0
 20.0

julia> row_missing_pct = [sum(ismissing.(Vector(df[row, :]))) / ncol(df) * 100 for row in 1:nrow(df)]

       # Add column with row percentages
10-element Vector{Float64}:
 20.0
 30.0
  0.0
 10.0
  0.0
  0.0
  0.0
 10.0
 30.0
  0.0

julia> df.row_missing_pct = row_missing_pct

       # Create a summary row with column percentages
10-element Vector{Float64}:
 20.0
 30.0
  0.0
 10.0
  0.0
  0.0
  0.0
 10.0
 30.0
  0.0

julia> summary_row = DataFrame()
0×0 DataFrame

julia> for col in names(df)[1:end-1]  # Exclude the row_missing_pct column
           summary_row[!, col] = [col_missing_pct[findfirst(==(col), names(df)[1:end-1])]]
       end

julia> summary_row.row_missing_pct = [missing]  # No percentage for the percentage column itself

       # Combine original DataFrame with summary row
1-element Vector{Missing}:
 missing

julia> df_with_summary = vcat(df, summary_row)

       # Rename the last row for clarity
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
  ⋮  │       ⋮             ⋮           ⋮            ⋮               ⋮             ⋮          ⋮             ⋮                ⋱
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            2 ⋯
                                                                                                 3 columns and 2 rows omitted

julia> rename!(df_with_summary, :row_missing_pct => :row_missing_pct)

       # Display the result
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9     ⋯
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float6 ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364          ⋯
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        - ⋯
  ⋮  │       ⋮             ⋮           ⋮            ⋮               ⋮             ⋮          ⋮             ⋮                ⋱
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missin
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missin
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            2 ⋯
                                                                                                 3 columns and 2 rows omitted

julia> println("DataFrame with 10% missing values:")
DataFrame with 10% missing values:

julia> println("(Last column shows % missing per row, last row shows % missing per column)")
(Last column shows % missing per row, last row shows % missing per column)

julia> println(df_with_summary)

       # Check overall percentage of missing values
11×11 DataFrame
 Row │ x1              x2          x3         x4              x5              x6         x7          x8              x9              x10             row_missing_pct 
     │ Float64?        Float64?    Float64?   Float64?        Float64?        Float64?   Float64?    Float64?        Float64?        Float64?        Float64?        
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.808288   1.21928     2.00811         1.90263   missing         -1.38289   -2.52561          0.18364         0.566026  missing                    20.0
   2 │ missing          0.292914    0.76503   missing              -1.53561   -0.692244   0.850062        -0.967371        0.615016  missing                    30.0
   3 │      -1.10464   -0.0311481   0.180254        0.364655       -0.396988  -0.956001   0.627024         0.29823        -0.690379        0.840295              0.0
   4 │ missing          0.315833    2.02891         0.785914        1.20248   -1.78927   -0.061257        -0.164931        0.518368        0.379051             10.0
   5 │       0.287588  -2.16238    -0.127637        0.170638       -0.720676   0.161147   0.267154        -1.41293        -1.19626        -0.108625              0.0
   6 │       0.229819  -0.890077    0.469477       -1.28902        -1.10708    0.758747  -1.46561          1.14809        -0.424905       -0.678775              0.0
   7 │      -0.421769   0.854242   -1.52688         0.937057       -1.26622    0.525484   0.696236        -0.27827        -0.220797       -1.08822               0.0
   8 │      -1.35559    0.341782    0.600132        0.508086       -0.722404  -0.904278  -0.224961         0.336889  missing               0.703758             10.0
   9 │ missing         -0.31887    -2.05386        -2.24974         0.523905   0.52406    0.0893458  missing         missing               0.143326             30.0
  10 │      -0.117323  -0.337454   -0.828057        0.125302        1.25942    0.747131   0.921248         0.624411        0.130942        0.148375              0.0
  11 │      30.0        0.0         0.0            10.0            10.0        0.0        0.0             10.0            20.0            20.0             missing   

Since this is a presentation table, rather than a data table, you can deal with the type-mismatch potential for the last row by saving it separately, then convert the rest to strings, convert the last row saved out to strings and vcat.

1 Like

Converting to strings fixed it, thank you! Is there a way that I can add a delimiting line after the dataframe and before the percentages while displaying? I looked in PrettyTables.jl but couldn’t find it immediately :sweat_smile:

Try this

using PrettyTables

hlines = [0, 1, 11]  # Top border, header separator, and before last row

pretty_table(df_with_summary, 
    show_row_number = false,
    hlines = hlines,
    header_crayon = crayon"bold",
)