[ANN] DataFramesMeta.jl v 0.15.0 Release

DataFramesMeta 0.15.0 Announcement

I’m happy to announce the version 0.15.0 release of DataFramesmeta.jl! We’ve added three new features that users
wil like.

  1. Multi-column selection in @select
  2. The @groupby macro for easier grouping syntax
  3. Column label and note creation with @label! and @note!

Multi-column selection.

using DataFramesMeta, CSV, Statistics, Downloads

First, let’s download the starwars dataset

url = "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
starwars = CSV.read(Downloads.download(url), DataFrame; missingstring = "NA")
10×14 DataFrame
 Row │ name                height  mass      hair_color     skin_color    ⋯
     │ String31            Int64?  Float64?  String15?      String31      ⋯
─────┼─────────────────────────────────────────────────────────────────────
   1 │ Luke Skywalker         172      77.0  blond          fair          ⋯
   2 │ C-3PO                  167      75.0  missing        gold
   3 │ R2-D2                   96      32.0  missing        white, blue
   4 │ Darth Vader            202     136.0  none           white
   5 │ Leia Organa            150      49.0  brown          light         ⋯
   6 │ Owen Lars              178     120.0  brown, grey    light
   7 │ Beru Whitesun Lars     165      75.0  brown          light
   8 │ R5-D4                   97      32.0  missing        white, red
   9 │ Biggs Darklighter      183      84.0  black          light         ⋯
  10 │ Obi-Wan Kenobi         182      77.0  auburn, white  fair

Select the :name column

@select starwars :name

Select beween columns :name and :mass

@select starwars Between(:name, :mass)

Select names starting with the letter “h”
(Some knowledge of Regular Expressions is required)

@select starwars Cols(r"^h")

Select names starting with “h” OR starting with “n”

@select starwars Cols(r"^h", r"^n")

Select names starting with “h” AND ending with “t”

@select starwars Cols(r"^h", r"t$"; operator = intersect)

Select all numeric columns (requires escaping). (This was possible before, but I’m showing it for completeness).

@select starwars $(names(starwars, Union{Real, Missing}))

@groupby macro

The @groupby macro is a thin wrapper around DataFrames.jl’s
groupby. It simply provides a way to avoid writing parentheses and
brackets.

(It’s also nicer to see the @ in a block of transformations.)

df = DataFrame(a = [1, 1, 2, 2], b = [100, 200, 50, 50]);

@chain df begin
    @groupby :a
    @transform :mean_b = mean(:b)
end

@chain df begin
    @groupby :a :b
    @transform :ngroup = length($1)
end

Metadata

DataFrames.jl, in conjunction with DataAPI.jl and TableMetaDataTools.jl,
implements metadata for information attached to a DataFrame.
DataFramesMeta.jl provides an opinionated format for adding labels
and notes to data frames.

They are thin wrappers around TableMetaDataTools.jl’s label! and
note! features.

Add labels with @label!

df = DataFrame(wage = [16, 25, 14, 23]);
@label! df :wage = "Wage (2015 USD)"

Add notes with @note!

@note! df begin
    :wage = "Hourly wage from 2015 American Community Survey (ACS)"
    :wage = "Missing values have been dropped"
end

DataFramesMeta.jl also provides printlabels and printnotes
for pretty-printing of metadata.

A list of all variables, showing the labels attached to them

printlabels(df)
┌────────┬─────────────────┐
│ Column │           Label │
├────────┼─────────────────┤
│   wage │ Wage (2015 USD) │
└────────┴─────────────────┘

Printing all notes, also printing labels

printnotes(df)
Column: wage
────────────
Label: Wage (2015 USD)
Hourly wage from 2015 American Community Survey (ACS)
Missing values have been dropped

Pro-tip: use TerminalPager.jl’s @stdout_to_pager macro to print the notes of a data frame. Then you can search all the column names, labels, and notes in your data set at once.

8 Likes