[ANN] (Belatedly) Announcing Tidier.jl

Great question. There is no specific origin other than the fact that tildes aren’t commonly used in data transformation but are easy to type. The $ won’t work because we need to modify the expression after interpolating it. Since $ is eagerly evaluated, we can’t tell which functions were marked with it in order to handle it differently in the auto-vectorization process.

You can think of the tilde as a flag to denote which functions not to vectorize. Once we complete the auto-vectorization step, we remove the flag.

Thanks - so basically $ was taken. It’s easier to remember when understanding the rationale behind it.

1 Like

Summer Happenings for Tidier.jl

  • Introducing TidierIteration.jl
  • What’s new in TidierData.jl, TidierDB.jl, TidierFiles.jl, and TidierCats.jl

Announcing TidierIteration.jl (v1.0.0)

TidierIteration.jl is a package aimed at making it easier to iterate on collections, modeled after the purrr R package. It also provides some tools of functional programming: adverbs, composition, safe-functions and more.

Here’s a list of the supported functions. map_* functions apply a function on a collection and return a collection. walk_* functions work similarly to map_* but do not return anything – they are primarily intended to be used where the function produces side effects only (e.g., saving output to files). modify_* functions update collections in-place. and flatten_* functions convert ragged collections (e.g., JSON-style data) into non-ragged collections.

Map

map_tidy, map_values (for iterating on Dict values), map_keys (for iterating on Dict keys), map_dfr, map_dfc, map2, imap, pmap

Walk

walk, walk2, iwalk, pwalk

Modify

modify, modify!, modify_values!, modify_if, modify_if!

Keep, Discard, and Compact

keep, keep!, keep_keys, discard, discard!, compact, compact!

Predicates

is_empty, is_non_empty, every, some, none, detect_index, detect, has_element, has_key, get_value

Adverbs

compose, compose_n, negate, possibly

Flatten

flatten, flatten_n, flatten_dfr, flatten_json, flatten_dfr_json, json_string, to_json

Why use TidierIteration.jl when Julia already has great iteration capabilities?

  • The collection is always the first argument of the map_* family of functions, which makes the functions easier to use inside of chains/pipes
  • We extend the map_* family to Julia objects which are not mapped by default, like dictionaries, for which we have map_values() and map_keys()
  • We also provide the map2, imap and pmap methods to map over 2 or n elements at the same time
  • We provide the flatten_* functions to tidy wild dictionaries (like JSON responses from APIs) and many adverbs.

TidierData.jl v0.16.2 released today

The latest version brings in a bugfix and some minor improvements:

  • Bugfix: @slice_min() and @slice_max() respect the n argument
  • Adds @head as a convenience wrapper around @slice_head()
  • Adds extra argument for @separate() and remove argument for @unite()

We’ve also added our first round of syntax comparisons to DataFrames.jl for users who go back and forth between the two packages: Comparison to DF.jl - TidierData.jl (tidierorg.github.io)

There are a number of TidierData.jl “features” we don’t currently highlight on the comparisons, so stay tuned for further expansion of this page.

TidierDB.jl is now up to v0.3.3 and gained a number of improvements over the summer

  • The package is much lighter and relies on package extensions for:
    • Postgres, ClickHouse, MySQL, MsSQL, SQLite, Oracle, Athena, and Google BigQuery
    • (Documentation)[Getting Started - TidierDB.jl] updated for using these backends.
  • adds support for reading from multiple files at once as a vector of paths in db_table when using DuckDB
    • ie db_table(db, ["path1", "path2"])
  • adds streaming support when using DuckDB with @collect(stream = true)
  • allows user to customize file reading via db_table(db, "read_*(path, args)") when using DuckDB
  • adds @head for limiting number of collected rows
  • adds support for reading URLs in db_table with ClickHouse
  • adds support for reading from multiple files at once as a vector of urls in db_table when using ClickHouse
    • ie db_table(db, ["url1", "url2"])
  • Bugfix: @count updates metadata
  • adds connect() support for Microsoft SQL Server
  • adds show_tables for most backends to view existing tables
  • Docs comparing TidierDB to Python’s Ibis: TidierDB.jl vs Ibis - TidierDB.jl
  • Docs around working with larger than RAM data: Working With Larger than RAM Datasets - TidierDB.jl

TidierFiles.jl v0.1.4 introduces a general file reader/writer function

Inspired by FileIO.jl and the rio R package, TidierFiles now includes read_file() and write_file() functions that work across all tabular file types supported by the package. This means that you can use a consistent interface (same arguments) across the following file types with a single function, which previously required the below bespoke functions:

  • read_csv and write_csv
  • read_tsv and write_tsv
  • read_xlsx and write_xlsx
  • read_delim and write_delim
  • read_table and write_table
  • read_fwf
  • read_sav and write_sav (.sav and .por)
  • read_sas and write_sas (.sas7bdat and .xpt)
  • read_dta and write_dta (.dta)
  • read_arrow and write_arrow
  • read_parquet and write_parquet
  • read_rdata (.rdata and .rds)

TidierCats.jl v0.1.2 was released last week

It adds 3 new functions for working with categorical variables:

  • cat_replace_missing: Lumps infrequent levels in a categorical array into an ‘other’ level based on proportion threshold.
  • cat_other: Replaces selected levels in a categorical array with the ‘other’ level.
  • cat_recode: Recodes the levels in a categorical array based on a provided mapping.

It’s a been a busy summer for Tidier! We are continuing to work on packages across our ecosystem and welcome users and contributors.

(Sharing on behalf of the Tidier team)

26 Likes

TidierDB.jl v.5.1 now includes

  • 65 tests demonstrating identical results between TidierData.jl and TidierDB.jl
  • ability to use TidierDB queries inside other macros, including @mutate , @filter , @summarize
ok = @chain t(df_mem) @summarize(mean = mean(value));   
@eval @chain t(df_mem) begin
   @filter(value > $ok)
   @collect
end
  • ability to join mutli-step TidierDB queries on various tables together as demonstrated in this DuckDB example recreating a chain with 6 @inner_joins, as well as cross schema joins
  • ability to create and use SQL views for multiple backends with @create_view
  • ability to create a table from a query on a backend with @compute
  • improved date handling with dmy , ymd, mdy and ability to add intervals
6 Likes