[ANN] (Belatedly) Announcing Tidier.jl

With the 1-year anniversary of Tidier.jl coming up soon, I wanted to put together an official (if slightly belated) package announcement describing Tidier.jl to invite you to give it a try.

Tidier.jl originally began as a package intended for working with DataFrames based on the R dplyr and tidyr packages as part of the R tidyverse ecosystem. That portion of Tidier.jl has since been split off into its own package (TiderData.jl), with Tidier.jl focused much more broadly on implementing the entire R tidyverse ecosystem in Julia.

In its current form, Tidier is a meta-package intended for generating, analyzing, transforming, and visualizing data frames. Tidier contains and re-exports the following packages:

While its origins and inspirations come from R, Tidier is designed from the ground up for Julia. It is an opinionated package in that it diverges from some of the concepts established by other macro-based data analysis packages in Julia. Tidier is different not by accident, but by design in an attempt to be user-friendly and easy to use for data analysts. Tidier does bring a bit of magic because of its reliance on macros, but this is done with an eye on usability, and we are careful to ensure that users retain the ability to override the magic.

Let me show you a quick example focused only on TidierData to introduce you to some of the key concepts in the package.

We will use the Visits to Physician Office dataset, which is abbreviated as ofp.

using TiderData, RDatasets
ofp = dataset("Ecdat", "OFP")
ofp = @clean_names(ofp)

@chain ofp begin
    @group_by(region)
    @summarize(mean_age = mean(age * 10))
end
4×2 DataFrame
 Row │ region   mean_age 
     │ Cat…     Float64  
─────┼───────────────────
   1 │ other     73.987
   2 │ midwest   74.0769
   3 │ noreast   73.9343
   4 │ west      74.1165

Here, we are calculating the mean age for each region. We first apply the @clean_names() macro, which converts the column names to snake_case formatting for convenience. We then group by region and calculate the mean age. Because the age is stored in decades and we want the result in years, we have multiply the age by 10.

A couple things to note in this code:

  1. TidierData automatically re-exports the @chain macro from Chain.jl. This makes it easy to write data pipelines. There’s no requirement to use it, but the docs and examples make heavy use of it. It also automatically re-exports the DataFrame() function from the DataFrames package.

  2. region and age are referred to in the code as “bare” names rather than as symbols (i.e., :region and :age).

This is intentional because it lets you write more concise code. The majority of names we refer to in data analysis pipelines refer to columns rather than external variables. You can think of code in TidierData as being within a “data frame” scope. If you want to refer to variables outside of the data frame, you can prefix the name with a !!.

For example, if you had a variable named grouping_variable that contained the value :region, you could rewrite the above code as:

grouping_variable = :region

@chain ofp begin
    @group_by(!!grouping_variable)
    @summarize(mean_age = mean(age * 10))
end

When using the !! notation, grouping_variable can also be a vector containing multiple symbols if you want to group by multiple variables. Read more about this here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/interpolation/

  1. Note that TidierData automatically vectorizes * so that the expression is converted to mean(age .* 10).When working with non-nested columns of data frames, most data analysis functions are usually intended to be vectorized. Rather than require users to individually vectorize each function, TidierData automatically takes care of the vectorization. While other packages provide a way to vectorize all functions, not all functions make sense to vectorize. For non-nested data, mean() should essentially never be vectorized. In other functions, such as a in b, a should be vectorized but b should not. See https://bkamins.github.io/julialang/2023/02/10/in.html for details on why this is the case. In TidierData, you can write a in b, and it will get converted to in.(a, Ref(Set(b))).

With all that said, TidierData remains fully configurable. Any function prefixed with a ~ will never get vectorized. You can also directly modify the list of functions not to vectorize (see in the link below). Any function that you vectorize manually, such as by writing mean.(), will remain vectorized. You can read more details about the behavior here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/

As a result of these concepts and many more syntactic sugar functions like across() and where(), very complicated multi-line code can often be reduced to something much more concise, readable, and understandable.

All of this is just scratching the surface of Tidier. It’s fully functional - it can handle all the common tasks of data analysis - pivoting, joins, nesting/unnesting, grouping, transformations, if_then/case_when logic, with full-on support for pipeable plotting and more. It’s built on top of the best-in-class Julia packages like DataFrames and Makie. If you’ve been waiting for Tidier to mature before trying it out, it’s ready for a look.

Tidier brings quite a bit of magic to data analysis in Julia – which some folks will love and others will not.

Acknowledgements: We owe a lot to the R tidyverse community for developing an API that we love and want to see more use of in Julia. We also rely heavily on DataFrames.jl (for TidierData), Makie.jl and AlgebraOfGraphics.jl (for TidierPlots), and many other packages.

47 Likes

Is it correct to say that Tidier is aimed at those coming from R? Asking as someone who had no (positive) experience with that language (:, but doing lots of data manipulation, analysis, and plotting in Julia. In particular, I see that comparisons in the docs are often done with tidyverse/ggplot/… or in that context.

2 Likes

Thanks @aplavin for the question.

I think the API will feel natural to folks coming from R’s tidyverse ecosystem, but there’s nothing in the API that is specific to R per se. There are implementations of the tidyverse API in Python (eg, Ibis, tidypolars, siuba, and others) that are also fairly nice to use. Even polars and pandas borrow some of these concepts in Python.

At this point, I’m mostly a Julia-first user. This is my invitation for other Julia users who have never used R to take a look at Tidier.jl.

5 Likes

Thanks! Then some comparisons with Julia would be useful, in addition to those with R.
Like, “this is how you do a plot with AoG.jl, and here see how much easier the same plot is with DataFrames and TidierPlots” (:

3 Likes

That’s a great suggestion. We can definitely add some documentation comparing the Tidier.jl way of doing common tasks to other commonly used Julia packages.

4 Likes

This is so great. The tidyverse system in R is so amazing and it’s one of the things I keep going back to R to use for all my table like data wrangling. Now I will not have to reach for R as often as before. :pray:t2::muscle:t2::tada:

5 Likes

I understand your negativity towards R at a certain level. But what I would also say is the R’s Tidyverse was a major advance in data science. Anything that extracts the elegance, ease of use the pure, mathy-granite-like-robustness of the Tidyverse and delivers it in a hot, new language like Julia, seems like a wonderful idea!

2 Likes

It’s less “negativity”, more “lack of positivity” (: I only used R a little bit during a uni course about a decade ago, that’s it. That just means comparisons with R are less understandable for me, and presumably many other Julia users.

A thing that I like about Julia, and that surprised me at first, is that there’s no strong need to use fancy and very specialized data structures (datatables, dataframes) for data processing. Simple arrays can take you a looong way! I personally never find myself in need of table-specialized data structures anymore…

But clearly, R/tidyverse is among the most popular data analysis tools in the world, together with Excel (definitely #1), and Python/Pandas (probably #2). So there are lots of opportunities to grab something from those ecosystems and adapt to Julia.

2 Likes

Announcing TidierData.jl v0.15.0

Inspired by DataFramesMeta.jl, TidierData.jl now supports begin-end blocks within all macros. This makes it easier to write parentheses-free code that spans multiple lines. This (optional) form also makes it easier to comment out portions of your code.

Here’s an example of a @filter with one of the conditions commented out:

@chain movies begin
  @filter begin
    # Year >= 2000
    Rating >= 9
  end
  @slice 1:5
  @select 1:5
end

5×5 DataFrame

Row Title Year Length Budget Rating
String Int32 Int32 Int32? Float64
1 +1 -1 1987 7 missing 9.4
2 100 Years at the Movies 1994 9 missing 9.2
3 13 Lakes 2004 135 missing 9.0
4 2wks, 1yr 2002 104 missing 9.4
5 500 Years Later 2005 106 missing 9.3

For a bit more explanation on how this works within the larger context of piping in TidierData.jl, check out the new docs page here: Piping - TidierData.jl (tidierorg.github.io)

8 Likes

Announcing TidierPlots v0.6.0

The biggest change is under the hood in that the package now relies directly on the new-and-evolving Makie API and no longer uses AoG as an intermediary. This will help us better implement the ggplot2 interface while also making the package more compatible with Makie.

PS. Shoutout to Randy Boyes for all the work on the package.

Changelog:

  • Dependency on AlgebraOfGraphics.jl removed
  • Support for positional aes specification (e.g. @aes(x, y) instead of @aes(x = x, y = y))
  • Support for horizontal bars in geom_bar and geom_col
  • All Makie plotting arguments are supported via passthrough (as either aes or args)
  • Breaks: geom_smooth, geom_contour, facet_wrap, facet_grid, all tests
1 Like