[ANN] (Belatedly) Announcing Tidier.jl

With the 1-year anniversary of Tidier.jl coming up soon, I wanted to put together an official (if slightly belated) package announcement describing Tidier.jl to invite you to give it a try.

Tidier.jl originally began as a package intended for working with DataFrames based on the R dplyr and tidyr packages as part of the R tidyverse ecosystem. That portion of Tidier.jl has since been split off into its own package (TiderData.jl), with Tidier.jl focused much more broadly on implementing the entire R tidyverse ecosystem in Julia.

In its current form, Tidier is a meta-package intended for generating, analyzing, transforming, and visualizing data frames. Tidier contains and re-exports the following packages:

While its origins and inspirations come from R, Tidier is designed from the ground up for Julia. It is an opinionated package in that it diverges from some of the concepts established by other macro-based data analysis packages in Julia. Tidier is different not by accident, but by design in an attempt to be user-friendly and easy to use for data analysts. Tidier does bring a bit of magic because of its reliance on macros, but this is done with an eye on usability, and we are careful to ensure that users retain the ability to override the magic.

Let me show you a quick example focused only on TidierData to introduce you to some of the key concepts in the package.

We will use the Visits to Physician Office dataset, which is abbreviated as ofp.

using TiderData, RDatasets
ofp = dataset("Ecdat", "OFP")
ofp = @clean_names(ofp)

@chain ofp begin
    @group_by(region)
    @summarize(mean_age = mean(age * 10))
end
4Γ—2 DataFrame
 Row β”‚ region   mean_age 
     β”‚ Cat…     Float64  
─────┼───────────────────
   1 β”‚ other     73.987
   2 β”‚ midwest   74.0769
   3 β”‚ noreast   73.9343
   4 β”‚ west      74.1165

Here, we are calculating the mean age for each region. We first apply the @clean_names() macro, which converts the column names to snake_case formatting for convenience. We then group by region and calculate the mean age. Because the age is stored in decades and we want the result in years, we have multiply the age by 10.

A couple things to note in this code:

  1. TidierData automatically re-exports the @chain macro from Chain.jl. This makes it easy to write data pipelines. There’s no requirement to use it, but the docs and examples make heavy use of it. It also automatically re-exports the DataFrame() function from the DataFrames package.

  2. region and age are referred to in the code as β€œbare” names rather than as symbols (i.e., :region and :age).

This is intentional because it lets you write more concise code. The majority of names we refer to in data analysis pipelines refer to columns rather than external variables. You can think of code in TidierData as being within a β€œdata frame” scope. If you want to refer to variables outside of the data frame, you can prefix the name with a !!.

For example, if you had a variable named grouping_variable that contained the value :region, you could rewrite the above code as:

grouping_variable = :region

@chain ofp begin
    @group_by(!!grouping_variable)
    @summarize(mean_age = mean(age * 10))
end

When using the !! notation, grouping_variable can also be a vector containing multiple symbols if you want to group by multiple variables. Read more about this here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/interpolation/

  1. Note that TidierData automatically vectorizes * so that the expression is converted to mean(age .* 10).When working with non-nested columns of data frames, most data analysis functions are usually intended to be vectorized. Rather than require users to individually vectorize each function, TidierData automatically takes care of the vectorization. While other packages provide a way to vectorize all functions, not all functions make sense to vectorize. For non-nested data, mean() should essentially never be vectorized. In other functions, such as a in b, a should be vectorized but b should not. See https://bkamins.github.io/julialang/2023/02/10/in.html for details on why this is the case. In TidierData, you can write a in b, and it will get converted to in.(a, Ref(Set(b))).

With all that said, TidierData remains fully configurable. Any function prefixed with a ~ will never get vectorized. You can also directly modify the list of functions not to vectorize (see in the link below). Any function that you vectorize manually, such as by writing mean.(), will remain vectorized. You can read more details about the behavior here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/

As a result of these concepts and many more syntactic sugar functions like across() and where(), very complicated multi-line code can often be reduced to something much more concise, readable, and understandable.

All of this is just scratching the surface of Tidier. It’s fully functional - it can handle all the common tasks of data analysis - pivoting, joins, nesting/unnesting, grouping, transformations, if_then/case_when logic, with full-on support for pipeable plotting and more. It’s built on top of the best-in-class Julia packages like DataFrames and Makie. If you’ve been waiting for Tidier to mature before trying it out, it’s ready for a look.

Tidier brings quite a bit of magic to data analysis in Julia – which some folks will love and others will not.

Acknowledgements: We owe a lot to the R tidyverse community for developing an API that we love and want to see more use of in Julia. We also rely heavily on DataFrames.jl (for TidierData), Makie.jl and AlgebraOfGraphics.jl (for TidierPlots), and many other packages.

60 Likes

Is it correct to say that Tidier is aimed at those coming from R? Asking as someone who had no (positive) experience with that language (:, but doing lots of data manipulation, analysis, and plotting in Julia. In particular, I see that comparisons in the docs are often done with tidyverse/ggplot/… or in that context.

2 Likes

Thanks @aplavin for the question.

I think the API will feel natural to folks coming from R’s tidyverse ecosystem, but there’s nothing in the API that is specific to R per se. There are implementations of the tidyverse API in Python (eg, Ibis, tidypolars, siuba, and others) that are also fairly nice to use. Even polars and pandas borrow some of these concepts in Python.

At this point, I’m mostly a Julia-first user. This is my invitation for other Julia users who have never used R to take a look at Tidier.jl.

7 Likes

Thanks! Then some comparisons with Julia would be useful, in addition to those with R.
Like, β€œthis is how you do a plot with AoG.jl, and here see how much easier the same plot is with DataFrames and TidierPlots” (:

3 Likes

That’s a great suggestion. We can definitely add some documentation comparing the Tidier.jl way of doing common tasks to other commonly used Julia packages.

4 Likes

This is so great. The tidyverse system in R is so amazing and it’s one of the things I keep going back to R to use for all my table like data wrangling. Now I will not have to reach for R as often as before. :pray:t2::muscle:t2::tada:

5 Likes

I understand your negativity towards R at a certain level. But what I would also say is the R’s Tidyverse was a major advance in data science. Anything that extracts the elegance, ease of use the pure, mathy-granite-like-robustness of the Tidyverse and delivers it in a hot, new language like Julia, seems like a wonderful idea!

3 Likes

It’s less β€œnegativity”, more β€œlack of positivity” (: I only used R a little bit during a uni course about a decade ago, that’s it. That just means comparisons with R are less understandable for me, and presumably many other Julia users.

A thing that I like about Julia, and that surprised me at first, is that there’s no strong need to use fancy and very specialized data structures (datatables, dataframes) for data processing. Simple arrays can take you a looong way! I personally never find myself in need of table-specialized data structures anymore…

But clearly, R/tidyverse is among the most popular data analysis tools in the world, together with Excel (definitely #1), and Python/Pandas (probably #2). So there are lots of opportunities to grab something from those ecosystems and adapt to Julia.

2 Likes

Announcing TidierData.jl v0.15.0

Inspired by DataFramesMeta.jl, TidierData.jl now supports begin-end blocks within all macros. This makes it easier to write parentheses-free code that spans multiple lines. This (optional) form also makes it easier to comment out portions of your code.

Here’s an example of a @filter with one of the conditions commented out:

@chain movies begin
  @filter begin
    # Year >= 2000
    Rating >= 9
  end
  @slice 1:5
  @select 1:5
end

5Γ—5 DataFrame

Row Title Year Length Budget Rating
String Int32 Int32 Int32? Float64
1 +1 -1 1987 7 missing 9.4
2 100 Years at the Movies 1994 9 missing 9.2
3 13 Lakes 2004 135 missing 9.0
4 2wks, 1yr 2002 104 missing 9.4
5 500 Years Later 2005 106 missing 9.3

For a bit more explanation on how this works within the larger context of piping in TidierData.jl, check out the new docs page here: Piping - TidierData.jl (tidierorg.github.io)

13 Likes

Announcing TidierPlots v0.6.0

The biggest change is under the hood in that the package now relies directly on the new-and-evolving Makie API and no longer uses AoG as an intermediary. This will help us better implement the ggplot2 interface while also making the package more compatible with Makie.

PS. Shoutout to Randy Boyes for all the work on the package.

Changelog:

  • Dependency on AlgebraOfGraphics.jl removed
  • Support for positional aes specification (e.g. @aes(x, y) instead of @aes(x = x, y = y))
  • Support for horizontal bars in geom_bar and geom_col
  • All Makie plotting arguments are supported via passthrough (as either aes or args)
  • Breaks: geom_smooth, geom_contour, facet_wrap, facet_grid, all tests
9 Likes

Announcing TidierDB.jl v0.1.0 (pending registry approval)

We are excited to release TidierOrg/TidierDB.jl, a package focused on bringing the syntax of Tidier.jl to multiple SQL backends, making it possible to analyze data directly on databases without needing to copy the entire database into memory.

Currently supported backends include:

  • DuckDB (the default) set_sql_mode(:duckdb)
  • ClickHouse set_sql_mode(:clickhouse)
  • SQLite set_sql_mode(:lite)
  • MySQL set_sql_mode(:mysql)
  • MSSQL set_sql_mode(:mssql)
  • Postgres set_sql_mode(:postgres)

TidierDB.jl supports the following top-level macros:

  • @arrange
  • @group_by
  • @filter
  • @select
  • @mutate, which supports across()
  • @summarize and @summarise, which supports across()
  • @distinct
  • @left_join, @right_join, @inner_join (slight syntax differences from TidierData.jl)
  • @count
  • @slice_min, @slice_max, @slice_sample
  • @window_order and window_frame
  • @show_query
  • @collect

Supported helper functions for most backends include:

  • across()
  • desc()
  • if_else() and case_when()
  • n()
  • starts_with(), ends_with(), and contains()
  • as_float(), as_integer(), and as_string()
  • is_missing()
  • missing_if() and replace_missing()

From TidierStrings.jl:

  • str_detect, str_replace, str_replace_all, str_remove_all, str_remove

From TidierDates.jl:

  • year, month, day, hour, min, second, floor_date, difftime

Supported aggregate functions (varying by backend):

  • mean, minimium, maximum, std, sum, cumsum, cor, cov, var
  • copy_to (for DuckDB, MySQL, SQLite)

DuckDB specifically enables copy_to to directly read in .parquet, .json, .csv, and .arrow files, including https file paths.

path = "file_path.parquet"
copy_to(conn, file_path, "table_name")

What is the recommended way to use TidierDB?

Typically, you will want to use TidierDB alongside TidierData because there are certain functionality (such as pivoting) which are only supported in TidierData and can only be performed on data frames.

Our recommended path for using TidierDB is to import the package so that there are no namespace conflicts with TidierData. Once TidierDB is integrated with Tidier, then Tidier will automatically load the packages in this fashion.

First, let’s develop and execute a query using TidierDB. Notice that all top-level macros and functions originating from TidierDB start with a DB prefix. Any functions defined within macros do not need to be prefixed within DB because they are actually pseudofunctions that are in actuality converted into SQL code.

Even though the code reads similarly to TidierData, note that no computational work actually occurs until you run DB.@collect(), which runs the SQL query and instantiates the result as a DataFrame.

using TidierData
import TidierDB as DB

mem = DB.duckdb_open(":memory:");
db = DB.duckdb_connect(mem);
path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
DB.copy_to(db, path, "mtcars");

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@collect
end
2Γ—5 DataFrame
 Row β”‚ cyl     mpg       mpg_squared  mpg_rounded  mpg_efficiency 
     β”‚ Int64?  Float64?  Float64?     Float64?     String?        
─────┼────────────────────────────────────────────────────────────
   1 β”‚      4   27.3444      747.719         27.0  efficient
   2 β”‚      6   19.7333      389.404         20.0  moderate

What if we wanted to pivot the result?

We cannot do this using TidierDB. However, we can call @pivot_longer() from TidierData after the result of the query has been instantiated as a DataFrame, like this:

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@collect
    @pivot_longer(everything(), names_to = "variable", values_to = "value")
end
10Γ—2 DataFrame
 Row β”‚ variable        value     
     β”‚ String          Any       
─────┼───────────────────────────
   1 β”‚ cyl             4
   2 β”‚ cyl             6
   3 β”‚ mpg             27.3444
   4 β”‚ mpg             19.7333
   5 β”‚ mpg_squared     747.719
   6 β”‚ mpg_squared     389.404
   7 β”‚ mpg_rounded     27.0
   8 β”‚ mpg_rounded     20.0
   9 β”‚ mpg_efficiency  efficient
  10 β”‚ mpg_efficiency  moderate

What SQL query does TidierDB generate for a given piece of Julia code?

We can replace DB.collect() with DB.@show_query to reveal the underlying SQL query being generated by TidierDB. To handle complex queries, TidierDB makes heavy use of Common Table Expressions (CTE), which are a useful tool to organize long queries.

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@show_query
end
WITH cte_1 AS (
SELECT *
        FROM mtcars
        WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
        FROM cte_1
        GROUP BY cyl),
cte_3 AS (
SELECT  cyl, mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
        FROM cte_2 ),
cte_4 AS (
SELECT *
        FROM cte_3
        WHERE mpg_efficiency in ('moderate', 'efficient'))  
SELECT *
        FROM cte_4  
        ORDER BY mpg_rounded DESC

TidierDB is already quite fully-featured, supporting advanced TidierData functions like across() for multi-column selection.

@chain DB.db_table(db, :mtcars) begin
    DB.@group_by(cyl)
    DB.@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
    DB.@collect
end
3Γ—5 DataFrame
 Row β”‚ cyl     mean_am   mean_vs   sum_am   sum_vs  
     β”‚ Int64?  Float64?  Float64?  Int128?  Int128? 
─────┼──────────────────────────────────────────────
   1 β”‚      4  0.727273  0.909091        8       10
   2 β”‚      6  0.428571  0.571429        3        4
   3 β”‚      8  0.142857  0.0             2        0

Bang-bang !! interpolation for columns and values is also supported.

There are a few subtle but important differences from Tidier.jl outlined here.

Missing a function or backend?

You can use any existing SQL function within @mutate with the correct SQL syntax and it should just work.

But if you run into problems please open an issue, and we will be happy to take a look!

18 Likes

This is such a joy to use. I’m used to dbplyr from the R ecosystem which I frequently grabbed for when needing to do analysis on data residing in PostgreSQL. A very welcome addition to my toolbox in Julia!

2 Likes

Working on a presentation for the R User group at my work to demonstrate how awesome Tidier.jl and Julia are. I’m not a Tidyverse user myself, but Tidier.jl is such a joy to use. Thank you!

4 Likes

This is such a great package!

I had somewhat of an off-topic question when it relates to the output… I have seen in the docs and here that the output of the dataframe has syntax-highlighting or colored outputs. I love that. I am using OhMyREPL but I am not getting that in outputs. Is that something that can be done automatically in the REPL?

2 Likes

Announcing TidierFiles.jl v0.1.0 (pending registry approval)

TidierFiles brings a consistent interface to the reading and writing of tabular data, including a consistent syntax to read files locally versus from the web (http/https) and consistent keyword arguments across data formats.

It provides a wrapper around CSV.jl, XLSX.jl, and ReadStatTables.jl. It’s narrower in scope as compared to FileIO.jl but provides better synchronization of keyword arguments across wrapped packages.

Currently supported file types:

  • read_csv and write_csv
  • read_tsv and write_tsv
  • read_xlsx and write_xlsx
  • read_delim and write_delim
  • read_table and write_table
  • read_fwf
  • read_sav and write_sav (.sav and .por)
  • read_sas and write_sas (.sas7bdat and .xpt)
  • read_dta and write_dta (.dta)
6 Likes

This is a great question. The colors are because of Documenter.jl performing code and output styling for the documentation. In the REPL, you’re right that outputs aren’t colored. I wonder if someone has looked at this, but maybe it’s doable?

I wonder if @bkamins knows if there’s a way to get colored output for DataFrames.

1 Like

See Home Β· Pretty Tables

4 Likes

Hi @kdpsingh

Thanks a lot for these updates - really happy to see the progress on these packages! At every turn the decisions seem well thought-through so wanted to ask - is there a semantic origin of ~ for escaping vectorization in Julia? I’m only familiar with a similar use of $ in preventing broadcasting of a function.