[ANN] (Belatedly) Announcing Tidier.jl

Great question. There is no specific origin other than the fact that tildes aren’t commonly used in data transformation but are easy to type. The $ won’t work because we need to modify the expression after interpolating it. Since $ is eagerly evaluated, we can’t tell which functions were marked with it in order to handle it differently in the auto-vectorization process.

You can think of the tilde as a flag to denote which functions not to vectorize. Once we complete the auto-vectorization step, we remove the flag.

Thanks - so basically $ was taken. It’s easier to remember when understanding the rationale behind it.

Summer Happenings for Tidier.jl

  • Introducing TidierIteration.jl
  • What’s new in TidierData.jl, TidierDB.jl, TidierFiles.jl, and TidierCats.jl

Announcing TidierIteration.jl (v1.0.0)

TidierIteration.jl is a package aimed at making it easier to iterate on collections, modeled after the purrr R package. It also provides some tools of functional programming: adverbs, composition, safe-functions and more.

Here’s a list of the supported functions. map_* functions apply a function on a collection and return a collection. walk_* functions work similarly to map_* but do not return anything – they are primarily intended to be used where the function produces side effects only (e.g., saving output to files). modify_* functions update collections in-place. and flatten_* functions convert ragged collections (e.g., JSON-style data) into non-ragged collections.


map_tidy, map_values (for iterating on Dict values), map_keys (for iterating on Dict keys), map_dfr, map_dfc, map2, imap, pmap


walk, walk2, iwalk, pwalk


modify, modify!, modify_values!, modify_if, modify_if!

Keep, Discard, and Compact

keep, keep!, keep_keys, discard, discard!, compact, compact!


is_empty, is_non_empty, every, some, none, detect_index, detect, has_element, has_key, get_value


compose, compose_n, negate, possibly


flatten, flatten_n, flatten_dfr, flatten_json, flatten_dfr_json, json_string, to_json

Why use TidierIteration.jl when Julia already has great iteration capabilities?

  • The collection is always the first argument of the map_* family of functions, which makes the functions easier to use inside of chains/pipes
  • We extend the map_* family to Julia objects which are not mapped by default, like dictionaries, for which we have map_values() and map_keys()
  • We also provide the map2, imap and pmap methods to map over 2 or n elements at the same time
  • We provide the flatten_* functions to tidy wild dictionaries (like JSON responses from APIs) and many adverbs.

TidierData.jl v0.16.2 released today

The latest version brings in a bugfix and some minor improvements:

  • Bugfix: @slice_min() and @slice_max() respect the n argument
  • Adds @head as a convenience wrapper around @slice_head()
  • Adds extra argument for @separate() and remove argument for @unite()

We’ve also added our first round of syntax comparisons to DataFrames.jl for users who go back and forth between the two packages: Comparison to DF.jl - TidierData.jl (tidierorg.github.io)

There are a number of TidierData.jl “features” we don’t currently highlight on the comparisons, so stay tuned for further expansion of this page.

TidierDB.jl is now up to v0.3.3 and gained a number of improvements over the summer

  • The package is much lighter and relies on package extensions for:
    • Postgres, ClickHouse, MySQL, MsSQL, SQLite, Oracle, Athena, and Google BigQuery
    • (Documentation)[Getting Started - TidierDB.jl] updated for using these backends.
  • adds support for reading from multiple files at once as a vector of paths in db_table when using DuckDB
    • ie db_table(db, ["path1", "path2"])
  • adds streaming support when using DuckDB with @collect(stream = true)
  • allows user to customize file reading via db_table(db, "read_*(path, args)") when using DuckDB
  • adds @head for limiting number of collected rows
  • adds support for reading URLs in db_table with ClickHouse
  • adds support for reading from multiple files at once as a vector of urls in db_table when using ClickHouse
    • ie db_table(db, ["url1", "url2"])
  • Bugfix: @count updates metadata
  • adds connect() support for Microsoft SQL Server
  • adds show_tables for most backends to view existing tables
  • Docs comparing TidierDB to Python’s Ibis: TidierDB.jl vs Ibis - TidierDB.jl
  • Docs around working with larger than RAM data: Working With Larger than RAM Datasets - TidierDB.jl

TidierFiles.jl v0.1.4 introduces a general file reader/writer function

Inspired by FileIO.jl and the rio R package, TidierFiles now includes read_file() and write_file() functions that work across all tabular file types supported by the package. This means that you can use a consistent interface (same arguments) across the following file types with a single function, which previously required the below bespoke functions:

  • read_csv and write_csv
  • read_tsv and write_tsv
  • read_xlsx and write_xlsx
  • read_delim and write_delim
  • read_table and write_table
  • read_fwf
  • read_sav and write_sav (.sav and .por)
  • read_sas and write_sas (.sas7bdat and .xpt)
  • read_dta and write_dta (.dta)
  • read_arrow and write_arrow
  • read_parquet and write_parquet
  • read_rdata (.rdata and .rds)

TidierCats.jl v0.1.2 was released last week

It adds 3 new functions for working with categorical variables:

  • cat_replace_missing: Lumps infrequent levels in a categorical array into an ‘other’ level based on proportion threshold.
  • cat_other: Replaces selected levels in a categorical array with the ‘other’ level.
  • cat_recode: Recodes the levels in a categorical array based on a provided mapping.

It’s a been a busy summer for Tidier! We are continuing to work on packages across our ecosystem and welcome users and contributors.

(Sharing on behalf of the Tidier team)


TidierDB.jl v.5.1 now includes

  • 65 tests demonstrating identical results between TidierData.jl and TidierDB.jl
  • ability to use TidierDB queries inside other macros, including @mutate , @filter , @summarize
ok = @chain t(df_mem) @summarize(mean = mean(value));   
@eval @chain t(df_mem) begin
   @filter(value > $ok)
  • ability to join mutli-step TidierDB queries on various tables together as demonstrated in this DuckDB example recreating a chain with 6 @inner_joins, as well as cross schema joins
  • ability to create and use SQL views for multiple backends with @create_view
  • ability to create a table from a query on a backend with @compute
  • improved date handling with dmy , ymd, mdy and ability to add intervals

Announcing Tidier.jl 1.5.0

Thanks to Randy Boyes, Daniel Rizk, and everyone who submitted issues or made suggestions. The new version resolves dependency conflicts, includes several underlying package updates, and bumps the minimum Julia version to 1.10. Here are key updates to some of the underlying packages over the past ~3 months.

TidierData.jl: is now up to v0.16.4

  • Bugfix: Only functions in Base, Core, and Statistics are not escaped. All other functions and callables are escaped.
  • Updated minimum Julia version to 1.10
  • Bugfix: @summary no longer errors with non-numeric columns. Instead, it only reports non-numeric summary stats on non-numeric columns. Minor changes to summary column names to be snake_case.
  • Bugfix: Reverted a bug introduced in v0.13.4, which escaped all macros. Now, string macros remain escaped (i.e., keeping it possible to work with Unitful units, e.g. u"psi"), but other macros are not escaped to allow for those macros to refer to column names within arguments.
  • Updated documentation on new preferred method of interpolation using @eval and $
  • Added documentation on using other macros inside of TidierData macros

TidierPlots.jl: is now up to v0.9.0

  • Refactor to directly wrap Makie SpecAPI
  • Multiple bugfixes to restore functionality broken by refactor
  • Calculations now work in macro aes
  • Fixes numerous small issues
  • Plots and geoms can now be broadcast

TidierDB.jl: is now up to v0.6.2

  • adds @intersect and @setdiff (SQLs INTERSECT and EXCEPT) respectively, with optional all argument
  • adds support for all arg to @union (equivalent to @union_all)
  • Bumps julia LTS to 1.10
  • Adds support for joining on multiple columns
  • Adds support for inequality joins
  • Adds support for AsOf / rolling joins
  • Equi-joins no longer duplicate key columns
  • Fixes bug to allow array columns to be mutated in
  • adds @relocate
  • bug fix when reading file paths with * wildcard with DuckDB
  • Fix edge case when creating an array column in @mutate
  • adds support _by support to @mutate and @summarize for grouping within the macro call.
  • adds support for n() in @mutate
  • add support for unnesting content to mutate/filter etc via column[key]syntax
  • db_table(db, name) now supports .geoparquet paths for DuckDB
  • support for reusing TidierDB queries inside other macros, including @mutate, @filter, @summarize
  • adds @union_all to bind all rows not just distinct rows as with @union
  • joining syntax now supports (table1, table2, col_name) when joining columns have shared name
  • if_else now has optional final argument for handling missing values to match TidierData

TidierData.jl v0.17.0 now supports logging

This has been a wishlist item ever since the first version of TidierData.jl came out.

Here’s a full list of updates in this release, along with an example showing the new logging feature in action.

  • Bugfix: @count() can now be called multiple times. If column n already exists, then the new column containing the count will be nn (and so on).
  • Bugfix: @unnest_wider() now works on data where keys are missing
  • Bugfix: Fixes @filter() involving multiple comparison operators (e.g., 3 <= a < 5), which have a :head of :comparison and are parsed differently than (3 <= a) && (a < 5)
  • Adds logging ability to track changes to data frames with TidierData_set("log", true)
  • Adds docs describing logging and code printing
julia> using TidierData
julia> using RDatasets
julia> movies = dataset("ggplot2", "movies");
julia> TidierData_set("log", true) # enable logging

julia> @chain movies begin
           @filter(Year > 2000)
           @mutate(Budget_cat = case_when(Budget > 18000 => "high",
                                          Budget > 2000  => "medium",
                                          Budget > 100 => "low",
                                           true => missing))
           @group_by(Year, Budget_cat)
           @summarize(Avg_Budget = mean(Budget), n = n())

[ Info: @filter: removed 50047 rows (85.0%), 8741 rows remaining. 
[ Info: @mutate: new variable "Budget_cat" with 4 unique values and 82.0% missing. 
[ Info: @filter: removed 7129 rows (82.0%), 1612 rows remaining. 
[ Info: @group_by added groups: ["Year", "Budget_cat"]
[ Info: @summarize returned a GroupedDataFrame (20 rows, 4 columns). 
[ Info: @ungroup removed groups: ["Year"]
20×4 DataFrame
 Row │ Year   Budget_cat  Avg_Budget    n     
     │ Int32  String?     Float64       Int64 
   1 │  2005  low         2000.0            1
   2 │  2002  missing        0.0            2
   3 │  2001  missing        0.0            3
   4 │  2001  low         1425.0            4
   5 │  2005  missing        0.0            4
   6 │  2004  missing        0.0            4
   7 │  2002  low         1500.0            7
   8 │  2003  missing        0.0            7
   9 │  2005  medium      8249.94          16
  10 │  2003  low         1443.48          23
  11 │  2001  medium      9580.0           25
  12 │  2004  low         1308.0           25
  13 │  2002  medium      7815.33          30
  14 │  2003  medium      8027.28          67
  15 │  2005  high           2.0684e7      82
  16 │  2004  medium      7946.05          91
  17 │  2003  high           2.14431e7    276
  18 │  2001  high           2.13646e7    289
  19 │  2002  high           2.17604e7    320
  20 │  2004  high           1.88698e7    336

Tidier.jl 1.6.0 is on its way to the Julia registry!

  • It makes it possible to seamlessly work across dataframes and databases without needing to manually dispatch to the correct TidierData vs. TidierDB macro. You can even mix and match code as long as you instantiate your SQL query into a data frame (using @collect) before using TidierData macros.

  • It newly re-exports TidierIteration.jl (akin to R’s {purrr} package), which focuses on convenience functions for iterating across Julia collections.

Check out the Tidier.jl homepage: https://tidierorg.github.io/Tidier.jl/dev/