[ANN] (Belatedly) Announcing Tidier.jl

Great question. There is no specific origin other than the fact that tildes aren’t commonly used in data transformation but are easy to type. The $ won’t work because we need to modify the expression after interpolating it. Since $ is eagerly evaluated, we can’t tell which functions were marked with it in order to handle it differently in the auto-vectorization process.

You can think of the tilde as a flag to denote which functions not to vectorize. Once we complete the auto-vectorization step, we remove the flag.

Thanks - so basically $ was taken. It’s easier to remember when understanding the rationale behind it.

1 Like

Summer Happenings for Tidier.jl

  • Introducing TidierIteration.jl
  • What’s new in TidierData.jl, TidierDB.jl, TidierFiles.jl, and TidierCats.jl

Announcing TidierIteration.jl (v1.0.0)

TidierIteration.jl is a package aimed at making it easier to iterate on collections, modeled after the purrr R package. It also provides some tools of functional programming: adverbs, composition, safe-functions and more.

Here’s a list of the supported functions. map_* functions apply a function on a collection and return a collection. walk_* functions work similarly to map_* but do not return anything – they are primarily intended to be used where the function produces side effects only (e.g., saving output to files). modify_* functions update collections in-place. and flatten_* functions convert ragged collections (e.g., JSON-style data) into non-ragged collections.

Map

map_tidy, map_values (for iterating on Dict values), map_keys (for iterating on Dict keys), map_dfr, map_dfc, map2, imap, pmap

Walk

walk, walk2, iwalk, pwalk

Modify

modify, modify!, modify_values!, modify_if, modify_if!

Keep, Discard, and Compact

keep, keep!, keep_keys, discard, discard!, compact, compact!

Predicates

is_empty, is_non_empty, every, some, none, detect_index, detect, has_element, has_key, get_value

Adverbs

compose, compose_n, negate, possibly

Flatten

flatten, flatten_n, flatten_dfr, flatten_json, flatten_dfr_json, json_string, to_json

Why use TidierIteration.jl when Julia already has great iteration capabilities?

  • The collection is always the first argument of the map_* family of functions, which makes the functions easier to use inside of chains/pipes
  • We extend the map_* family to Julia objects which are not mapped by default, like dictionaries, for which we have map_values() and map_keys()
  • We also provide the map2, imap and pmap methods to map over 2 or n elements at the same time
  • We provide the flatten_* functions to tidy wild dictionaries (like JSON responses from APIs) and many adverbs.

TidierData.jl v0.16.2 released today

The latest version brings in a bugfix and some minor improvements:

  • Bugfix: @slice_min() and @slice_max() respect the n argument
  • Adds @head as a convenience wrapper around @slice_head()
  • Adds extra argument for @separate() and remove argument for @unite()

We’ve also added our first round of syntax comparisons to DataFrames.jl for users who go back and forth between the two packages: Comparison to DF.jl - TidierData.jl (tidierorg.github.io)

There are a number of TidierData.jl “features” we don’t currently highlight on the comparisons, so stay tuned for further expansion of this page.

TidierDB.jl is now up to v0.3.3 and gained a number of improvements over the summer

  • The package is much lighter and relies on package extensions for:
    • Postgres, ClickHouse, MySQL, MsSQL, SQLite, Oracle, Athena, and Google BigQuery
    • (Documentation)[Getting Started - TidierDB.jl] updated for using these backends.
  • adds support for reading from multiple files at once as a vector of paths in db_table when using DuckDB
    • ie db_table(db, ["path1", "path2"])
  • adds streaming support when using DuckDB with @collect(stream = true)
  • allows user to customize file reading via db_table(db, "read_*(path, args)") when using DuckDB
  • adds @head for limiting number of collected rows
  • adds support for reading URLs in db_table with ClickHouse
  • adds support for reading from multiple files at once as a vector of urls in db_table when using ClickHouse
    • ie db_table(db, ["url1", "url2"])
  • Bugfix: @count updates metadata
  • adds connect() support for Microsoft SQL Server
  • adds show_tables for most backends to view existing tables
  • Docs comparing TidierDB to Python’s Ibis: TidierDB.jl vs Ibis - TidierDB.jl
  • Docs around working with larger than RAM data: Working With Larger than RAM Datasets - TidierDB.jl

TidierFiles.jl v0.1.4 introduces a general file reader/writer function

Inspired by FileIO.jl and the rio R package, TidierFiles now includes read_file() and write_file() functions that work across all tabular file types supported by the package. This means that you can use a consistent interface (same arguments) across the following file types with a single function, which previously required the below bespoke functions:

  • read_csv and write_csv
  • read_tsv and write_tsv
  • read_xlsx and write_xlsx
  • read_delim and write_delim
  • read_table and write_table
  • read_fwf
  • read_sav and write_sav (.sav and .por)
  • read_sas and write_sas (.sas7bdat and .xpt)
  • read_dta and write_dta (.dta)
  • read_arrow and write_arrow
  • read_parquet and write_parquet
  • read_rdata (.rdata and .rds)

TidierCats.jl v0.1.2 was released last week

It adds 3 new functions for working with categorical variables:

  • cat_replace_missing: Lumps infrequent levels in a categorical array into an ‘other’ level based on proportion threshold.
  • cat_other: Replaces selected levels in a categorical array with the ‘other’ level.
  • cat_recode: Recodes the levels in a categorical array based on a provided mapping.

It’s a been a busy summer for Tidier! We are continuing to work on packages across our ecosystem and welcome users and contributors.

(Sharing on behalf of the Tidier team)

22 Likes