[ANN] DataSkimmer.jl - Summarize tabular data in the REPL

Hasnep · March 26, 2021, 4:57pm

DataSkimmer.jl exposes a function skim() which prints summary statistics in the REPL. It was inspired by the output of the skimr R package.

The goal is to be able to summarise any Tables.jl compatible table, so if you want to help, try running skim() on any tables you have and send me examples where it breaks.

Here’s an example using the iris dataset:

# Load some data
using RDatasets
iris = RDatasets.dataset("datasets", "iris")

# Skim the data
using DataSkimmer
skim(iris)

┌─────────────────────┬───────────┐
│                Type │ DataFrame │
│             N. rows │       150 │
│             N. cols │         5 │
│     N. numeric cols │         4 │
│ N. categorical cols │         1 │
│    N. datetime cols │         0 │
└─────────────────────┴───────────┘

4 numeric columns
┌─────────────┬─────────┬──────────┬──────────┬──────┬──────┬──────┬──────┬──────┬───────┐
│        Name │    Type │ Missings │ Complete │ Mean │ Std. │ Min. │ Med. │ Max. │ Hist. │
├─────────────┼─────────┼──────────┼──────────┼──────┼──────┼──────┼──────┼──────┼───────┤
│ SepalLength │ Float64 │        0 │   100.0% │ 5.84 │ 0.83 │  4.3 │  5.8 │  7.9 │ ▂▃▃▂▁ │
│  SepalWidth │ Float64 │        0 │   100.0% │ 3.06 │ 0.44 │  2.0 │  3.0 │  4.4 │ ▁▃▄▂▁ │
│ PetalLength │ Float64 │        0 │   100.0% │ 3.76 │ 1.77 │  1.0 │ 4.35 │  6.9 │ ▃▁▂▃▁ │
│  PetalWidth │ Float64 │        0 │   100.0% │  1.2 │ 0.76 │  0.1 │  1.3 │  2.5 │ ▃▁▃▂▂ │
└─────────────┴─────────┴──────────┴──────────┴──────┴──────┴──────┴──────┴──────┴───────┘

1 categorical column
┌─────────┬────────────────────────────────┬──────────┬──────────┐
│    Name │                           Type │ Missings │ Complete │
├─────────┼────────────────────────────────┼──────────┼──────────┤
│ Species │ CategoricalValue{String,UInt8} │        0 │   100.0% │
└─────────┴────────────────────────────────┴──────────┴──────────┘

No datetime columns

tbeason · March 26, 2021, 5:06pm

Looks cool. Any plans for customization?

jzr · March 27, 2021, 4:32am

Categorical summary could show the number of categories.
“No datetime columns” seems unnecessary. There are lots of column types that aren’t present.

oschulz · March 27, 2021, 8:05am

Would it be possible to support StructArrays v0.5? DataSkimmer currently causes some package downgrades - but I like it a lot!

Hasnep · March 27, 2021, 11:54am

I’d considered it, but I’m not sure what options would be useful, do you have any suggestions?

Hasnep · March 27, 2021, 11:55am

Thank you for your feedback, I agree with both of the points you made. The “No datetime columns” text was something I added when debugging and forgot to remove.

rafael.guerra · March 27, 2021, 12:19pm

@Hasnep, thank you for such a nice and extremely useful package.

Just a couple of comments, if you will:

for the numerical entries, the number of digits printed could improve a bit further in terms of consistency? (see example below). Maybe a customizable parameter in a future version.
The column “Missings” might be called “Missing”

In case you want to reproduce the above, the excel sheet was downloaded from Microsoft site as per working example below.

using  XLSX, DataFrames
url = "https://download.microsoft.com/download/1/4/E/14EDED28-6C58-4055-A65C-23B4DA81C4DE/Financial%20Sample.xlsx"
download(url, "Sample_data2.xlsx")
df = DataFrame(XLSX.readtable("Sample_data2.xlsx", "Sheet1", infer_eltypes=true)...)

using DataSkimmer
skim(df)

PS: do not know why discourse does not respect the REPL output formatting…

pdeffebach · March 27, 2021, 6:41pm

It looks like it doesn’t work with NamedTuples of Vectors or Vectors of NamedTupless, the most “bare-bones” table types. Would you mind a PR to help implement that?

rafael.guerra · March 27, 2021, 7:28pm

One can always convert those to a table/df first, and then skim them:

using DataSkimmer, DataFrames
nt = (a=1,b="hello")    # named tuple
df = DataFrame((nt,))
skim(df)

Tamas_Papp · March 28, 2021, 7:09am

Sure, but the whole point of

is to make these conversions unnecessary.

Hasnep · March 28, 2021, 7:01pm

@oschulz

Would it be possible to support StructArrays v0.5?

@rafael.guerra

for the numerical entries, the number of digits printed could improve a bit further in terms of consistency? (see example below). Maybe a customizable parameter in a future version.

I have released version 0.2.0 which should address both of these. Thanks for your suggestions

@pdeffebach

Would you mind a PR to help implement that?

I’d love a contribution! I started a branch here where I added some unit tests for those cases. You can use that as a starting point for a PR or make your own.

rafael.guerra · March 28, 2021, 7:26pm

@Hasnep, thank you.
In the new version the above numeric data table looks prettier now:

matthieu · April 12, 2021, 9:54pm

Cool package! TablesSkimmer may be a better name though.

Hasnep · April 18, 2021, 8:29pm

I come from an R background, so I was thinking in terms of dataframes when I named it. That’s a better name, but my impression is that renaming packages in Julia is a bit inconvenient.

Topic		Replies	Views
Any Pretty Dataframe printing? General Usage dataframes , pluto	3	1475	April 11, 2021
A proposal for `describe` of a DataFrame Data dataframes	17	4284	April 28, 2018
Common API for tabular data backends Data	44	2649	August 28, 2020
DataTables or DataFrames? Data question	32	15379	November 19, 2018
[ANN-RFC] DFMacros.jl Package Announcements dataframes	30	2029	June 19, 2021

[ANN] DataSkimmer.jl - Summarize tabular data in the REPL

Related topics