[ANN] DataSkimmer.jl - Summarize tabular data in the REPL

DataSkimmer.jl exposes a function skim() which prints summary statistics in the REPL. It was inspired by the output of the skimr R package.

The goal is to be able to summarise any Tables.jl compatible table, so if you want to help, try running skim() on any tables you have and send me examples where it breaks. :sweat_smile:

Here’s an example using the iris dataset:

# Load some data
using RDatasets
iris = RDatasets.dataset("datasets", "iris")

# Skim the data
using DataSkimmer
skim(iris)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Type β”‚ DataFrame β”‚
β”‚             N. rows β”‚       150 β”‚
β”‚             N. cols β”‚         5 β”‚
β”‚     N. numeric cols β”‚         4 β”‚
β”‚ N. categorical cols β”‚         1 β”‚
β”‚    N. datetime cols β”‚         0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4 numeric columns
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Name β”‚    Type β”‚ Missings β”‚ Complete β”‚ Mean β”‚ Std. β”‚ Min. β”‚ Med. β”‚ Max. β”‚ Hist. β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SepalLength β”‚ Float64 β”‚        0 β”‚   100.0% β”‚ 5.84 β”‚ 0.83 β”‚  4.3 β”‚  5.8 β”‚  7.9 β”‚ ▂▃▃▂▁ β”‚
β”‚  SepalWidth β”‚ Float64 β”‚        0 β”‚   100.0% β”‚ 3.06 β”‚ 0.44 β”‚  2.0 β”‚  3.0 β”‚  4.4 β”‚ ▁▃▄▂▁ β”‚
β”‚ PetalLength β”‚ Float64 β”‚        0 β”‚   100.0% β”‚ 3.76 β”‚ 1.77 β”‚  1.0 β”‚ 4.35 β”‚  6.9 β”‚ ▃▁▂▃▁ β”‚
β”‚  PetalWidth β”‚ Float64 β”‚        0 β”‚   100.0% β”‚  1.2 β”‚ 0.76 β”‚  0.1 β”‚  1.3 β”‚  2.5 β”‚ ▃▁▃▂▂ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

1 categorical column
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Name β”‚                           Type β”‚ Missings β”‚ Complete β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Species β”‚ CategoricalValue{String,UInt8} β”‚        0 β”‚   100.0% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

No datetime columns
23 Likes

Looks cool. Any plans for customization?

  • Categorical summary could show the number of categories.
  • β€œNo datetime columns” seems unnecessary. There are lots of column types that aren’t present.
1 Like

Would it be possible to support StructArrays v0.5? DataSkimmer currently causes some package downgrades - but I like it a lot!

I’d considered it, but I’m not sure what options would be useful, do you have any suggestions?

Thank you for your feedback, I agree with both of the points you made. The β€œNo datetime columns” text was something I added when debugging and forgot to remove.

@Hasnep, thank you for such a nice and extremely useful package.

Just a couple of comments, if you will:

  • for the numerical entries, the number of digits printed could improve a bit further in terms of consistency? (see example below). Maybe a customizable parameter in a future version.
  • The column β€œMissings” might be called β€œMissing”

In case you want to reproduce the above, the excel sheet was downloaded from Microsoft site as per working example below.

using  XLSX, DataFrames
url = "https://download.microsoft.com/download/1/4/E/14EDED28-6C58-4055-A65C-23B4DA81C4DE/Financial%20Sample.xlsx"
download(url, "Sample_data2.xlsx")
df = DataFrame(XLSX.readtable("Sample_data2.xlsx", "Sheet1", infer_eltypes=true)...)

using DataSkimmer
skim(df)

PS: do not know why discourse does not respect the REPL output formatting…

1 Like

It looks like it doesn’t work with NamedTuples of Vectors or Vectors of NamedTupless, the most β€œbare-bones” table types. Would you mind a PR to help implement that?

2 Likes

One can always convert those to a table/df first, and then skim them:

using DataSkimmer, DataFrames
nt = (a=1,b="hello")    # named tuple
df = DataFrame((nt,))
skim(df)

2 Likes

Sure, but the whole point of

is to make these conversions unnecessary.

3 Likes

@oschulz

Would it be possible to support StructArrays v0.5?

@rafael.guerra

for the numerical entries, the number of digits printed could improve a bit further in terms of consistency? (see example below). Maybe a customizable parameter in a future version.

I have released version 0.2.0 which should address both of these. Thanks for your suggestions

@pdeffebach

Would you mind a PR to help implement that?

I’d love a contribution! I started a branch here where I added some unit tests for those cases. You can use that as a starting point for a PR or make your own.

1 Like

@Hasnep, thank you.
In the new version the above numeric data table looks prettier now:

2 Likes

Cool package! TablesSkimmer may be a better name though.

I come from an R background, so I was thinking in terms of dataframes when I named it. That’s a better name, but my impression is that renaming packages in Julia is a bit inconvenient. :confused: