[ANN] (Belatedly) Announcing Tidier.jl

kdpsingh · July 30, 2024, 4:23pm

Great question. There is no specific origin other than the fact that tildes aren’t commonly used in data transformation but are easy to type. The $ won’t work because we need to modify the expression after interpolating it. Since $ is eagerly evaluated, we can’t tell which functions were marked with it in order to handle it differently in the auto-vectorization process.

You can think of the tilde as a flag to denote which functions not to vectorize. Once we complete the auto-vectorization step, we remove the flag.

hatmatrix · July 30, 2024, 7:36pm

Thanks - so basically $ was taken. It’s easier to remember when understanding the rationale behind it.

kdpsingh · September 3, 2024, 9:29pm

Summer Happenings for Tidier.jl

Introducing TidierIteration.jl
What’s new in TidierData.jl, TidierDB.jl, TidierFiles.jl, and TidierCats.jl

Announcing TidierIteration.jl (v1.0.0)

TidierIteration.jl is a package aimed at making it easier to iterate on collections, modeled after the purrr R package. It also provides some tools of functional programming: adverbs, composition, safe-functions and more.

Here’s a list of the supported functions. map_* functions apply a function on a collection and return a collection. walk_* functions work similarly to map_* but do not return anything – they are primarily intended to be used where the function produces side effects only (e.g., saving output to files). modify_* functions update collections in-place. and flatten_* functions convert ragged collections (e.g., JSON-style data) into non-ragged collections.

Map

map_tidy, map_values (for iterating on Dict values), map_keys (for iterating on Dict keys), map_dfr, map_dfc, map2, imap, pmap

Walk

walk, walk2, iwalk, pwalk

Modify

modify, modify!, modify_values!, modify_if, modify_if!

Keep, Discard, and Compact

keep, keep!, keep_keys, discard, discard!, compact, compact!

Predicates

is_empty, is_non_empty, every, some, none, detect_index, detect, has_element, has_key, get_value

Adverbs

compose, compose_n, negate, possibly

Flatten

flatten, flatten_n, flatten_dfr, flatten_json, flatten_dfr_json, json_string, to_json

Why use TidierIteration.jl when Julia already has great iteration capabilities?

The collection is always the first argument of the map_* family of functions, which makes the functions easier to use inside of chains/pipes
We extend the map_* family to Julia objects which are not mapped by default, like dictionaries, for which we have map_values() and map_keys()
We also provide the map2, imap and pmap methods to map over 2 or n elements at the same time
We provide the flatten_* functions to tidy wild dictionaries (like JSON responses from APIs) and many adverbs.

TidierData.jl v0.16.2 released today

The latest version brings in a bugfix and some minor improvements:

Bugfix: @slice_min() and @slice_max() respect the n argument
Adds @head as a convenience wrapper around @slice_head()
Adds extra argument for @separate() and remove argument for @unite()

We’ve also added our first round of syntax comparisons to DataFrames.jl for users who go back and forth between the two packages: Comparison to DF.jl - TidierData.jl (tidierorg.github.io)

There are a number of TidierData.jl “features” we don’t currently highlight on the comparisons, so stay tuned for further expansion of this page.

TidierDB.jl is now up to v0.3.3 and gained a number of improvements over the summer

The package is much lighter and relies on package extensions for:
- Postgres, ClickHouse, MySQL, MsSQL, SQLite, Oracle, Athena, and Google BigQuery
- (Documentation)[Getting Started - TidierDB.jl] updated for using these backends.
adds support for reading from multiple files at once as a vector of paths in db_table when using DuckDB
- ie db_table(db, ["path1", "path2"])
adds streaming support when using DuckDB with @collect(stream = true)
allows user to customize file reading via db_table(db, "read_*(path, args)") when using DuckDB
adds @head for limiting number of collected rows
adds support for reading URLs in db_table with ClickHouse
adds support for reading from multiple files at once as a vector of urls in db_table when using ClickHouse
- ie db_table(db, ["url1", "url2"])
Bugfix: @count updates metadata
adds connect() support for Microsoft SQL Server
adds show_tables for most backends to view existing tables
Docs comparing TidierDB to Python’s Ibis: TidierDB.jl vs Ibis - TidierDB.jl
Docs around working with larger than RAM data: Working With Larger than RAM Datasets - TidierDB.jl

TidierFiles.jl v0.1.4 introduces a general file reader/writer function

Inspired by FileIO.jl and the rio R package, TidierFiles now includes read_file() and write_file() functions that work across all tabular file types supported by the package. This means that you can use a consistent interface (same arguments) across the following file types with a single function, which previously required the below bespoke functions:

read_csv and write_csv
read_tsv and write_tsv
read_xlsx and write_xlsx
read_delim and write_delim
read_table and write_table
read_fwf
read_sav and write_sav (.sav and .por)
read_sas and write_sas (.sas7bdat and .xpt)
read_dta and write_dta (.dta)
read_arrow and write_arrow
read_parquet and write_parquet
read_rdata (.rdata and .rds)

TidierCats.jl v0.1.2 was released last week

It adds 3 new functions for working with categorical variables:

cat_replace_missing: Lumps infrequent levels in a categorical array into an ‘other’ level based on proportion threshold.
cat_other: Replaces selected levels in a categorical array with the ‘other’ level.
cat_recode: Recodes the levels in a categorical array based on a provided mapping.

It’s a been a busy summer for Tidier! We are continuing to work on packages across our ecosystem and welcome users and contributors.

(Sharing on behalf of the Tidier team)

drizk1 · November 14, 2024, 9:00pm

TidierDB.jl v.5.1 now includes

65 tests demonstrating identical results between TidierData.jl and TidierDB.jl
ability to use TidierDB queries inside other macros, including @mutate , @filter , @summarize

ok = @chain t(df_mem) @summarize(mean = mean(value));   
@eval @chain t(df_mem) begin
   @filter(value > $ok)
   @collect
end

ability to join mutli-step TidierDB queries on various tables together as demonstrated in this DuckDB example recreating a chain with 6 @inner_joins, as well as cross schema joins
ability to create and use SQL views for multiple backends with @create_view
ability to create a table from a query on a backend with @compute
improved date handling with dmy , ymd, mdy and ability to add intervals

kdpsingh · January 11, 2025, 11:42am

Announcing Tidier.jl 1.5.0

Thanks to Randy Boyes, Daniel Rizk, and everyone who submitted issues or made suggestions. The new version resolves dependency conflicts, includes several underlying package updates, and bumps the minimum Julia version to 1.10. Here are key updates to some of the underlying packages over the past ~3 months.

TidierData.jl: is now up to v0.16.4

Bugfix: Only functions in Base, Core, and Statistics are not escaped. All other functions and callables are escaped.
Updated minimum Julia version to 1.10
Bugfix: @summary no longer errors with non-numeric columns. Instead, it only reports non-numeric summary stats on non-numeric columns. Minor changes to summary column names to be snake_case.
Bugfix: Reverted a bug introduced in v0.13.4, which escaped all macros. Now, string macros remain escaped (i.e., keeping it possible to work with Unitful units, e.g. u"psi"), but other macros are not escaped to allow for those macros to refer to column names within arguments.
Updated documentation on new preferred method of interpolation using @eval and $
Added documentation on using other macros inside of TidierData macros

TidierPlots.jl: is now up to v0.9.0

Refactor to directly wrap Makie SpecAPI
Multiple bugfixes to restore functionality broken by refactor
Calculations now work in macro aes
Fixes numerous small issues
Plots and geoms can now be broadcast

TidierDB.jl: is now up to v0.6.2

adds @intersect and @setdiff (SQLs INTERSECT and EXCEPT) respectively, with optional all argument
adds support for all arg to @union (equivalent to @union_all)
Bumps julia LTS to 1.10
Adds support for joining on multiple columns
Adds support for inequality joins
Adds support for AsOf / rolling joins
Equi-joins no longer duplicate key columns
Fixes bug to allow array columns to be mutated in
adds @relocate
bug fix when reading file paths with * wildcard with DuckDB
Fix edge case when creating an array column in @mutate
adds support _by support to @mutate and @summarize for grouping within the macro call.
adds support for n() in @mutate
add support for unnesting content to mutate/filter etc via column[key]syntax
db_table(db, name) now supports .geoparquet paths for DuckDB
support for reusing TidierDB queries inside other macros, including @mutate, @filter, @summarize
adds @union_all to bind all rows not just distinct rows as with @union
joining syntax now supports (table1, table2, col_name) when joining columns have shared name
if_else now has optional final argument for handling missing values to match TidierData

kdpsingh · March 25, 2025, 2:13am

TidierData.jl v0.17.0 now supports logging

This has been a wishlist item ever since the first version of TidierData.jl came out.

Here’s a full list of updates in this release, along with an example showing the new logging feature in action.

Bugfix: @count() can now be called multiple times. If column n already exists, then the new column containing the count will be nn (and so on).
Bugfix: @unnest_wider() now works on data where keys are missing
Bugfix: Fixes @filter() involving multiple comparison operators (e.g., 3 <= a < 5), which have a :head of :comparison and are parsed differently than (3 <= a) && (a < 5)
Adds logging ability to track changes to data frames with TidierData_set("log", true)
Adds docs describing logging and code printing

julia> using TidierData
julia> using RDatasets
julia> movies = dataset("ggplot2", "movies");
julia> TidierData_set("log", true) # enable logging

julia> @chain movies begin
           @filter(Year > 2000)
           @mutate(Budget_cat = case_when(Budget > 18000 => "high",
                                          Budget > 2000  => "medium",
                                          Budget > 100 => "low",
                                           true => missing))
           @filter(!ismissing(Budget))
           @group_by(Year, Budget_cat)
           @summarize(Avg_Budget = mean(Budget), n = n())
           @ungroup
           @arrange(n)
       end

[ Info: @filter: removed 50047 rows (85.0%), 8741 rows remaining. 
[ Info: @mutate: new variable "Budget_cat" with 4 unique values and 82.0% missing. 
[ Info: @filter: removed 7129 rows (82.0%), 1612 rows remaining. 
[ Info: @group_by added groups: ["Year", "Budget_cat"]
[ Info: @summarize returned a GroupedDataFrame (20 rows, 4 columns). 
[ Info: @ungroup removed groups: ["Year"]

20×4 DataFrame
 Row │ Year   Budget_cat  Avg_Budget    n     
     │ Int32  String?     Float64       Int64 
─────┼────────────────────────────────────────
   1 │  2005  low         2000.0            1
   2 │  2002  missing        0.0            2
   3 │  2001  missing        0.0            3
   4 │  2001  low         1425.0            4
   5 │  2005  missing        0.0            4
   6 │  2004  missing        0.0            4
   7 │  2002  low         1500.0            7
   8 │  2003  missing        0.0            7
   9 │  2005  medium      8249.94          16
  10 │  2003  low         1443.48          23
  11 │  2001  medium      9580.0           25
  12 │  2004  low         1308.0           25
  13 │  2002  medium      7815.33          30
  14 │  2003  medium      8027.28          67
  15 │  2005  high           2.0684e7      82
  16 │  2004  medium      7946.05          91
  17 │  2003  high           2.14431e7    276
  18 │  2001  high           2.13646e7    289
  19 │  2002  high           2.17604e7    320
  20 │  2004  high           1.88698e7    336

kdpsingh · March 26, 2025, 12:52pm

Tidier.jl 1.6.0 is on its way to the Julia registry!

It makes it possible to seamlessly work across dataframes and databases without needing to manually dispatch to the correct TidierData vs. TidierDB macro. You can even mix and match code as long as you instantiate your SQL query into a data frame (using @collect) before using TidierData macros.
It newly re-exports TidierIteration.jl (akin to R’s {purrr} package), which focuses on convenience functions for iterating across Julia collections.

Check out the Tidier.jl homepage: https://tidierorg.github.io/Tidier.jl/dev/

Topic		Replies	Views
Trying to understand macro scope (Tidier package) General Usage	20	1723	September 11, 2023
What's the latest and greatest in data in Julia Data	29	2128	August 15, 2024
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4176	April 8, 2024
[ANN] SQLCollections.jl – use Julia data manipulation functions for databases Package Announcements query , database	31	1276	October 28, 2024
Common API for tabular data backends Data	44	2653	August 28, 2020