With the 1-year anniversary of Tidier.jl coming up soon, I wanted to put together an official (if slightly belated) package announcement describing Tidier.jl to invite you to give it a try.
Tidier.jl originally began as a package intended for working with DataFrames based on the R dplyr and tidyr packages as part of the R tidyverse ecosystem. That portion of Tidier.jl has since been split off into its own package (TiderData.jl), with Tidier.jl focused much more broadly on implementing the entire R tidyverse ecosystem in Julia.
In its current form, Tidier is a meta-package intended for generating, analyzing, transforming, and visualizing data frames. Tidier contains and re-exports the following packages:
- TidierData: analyzing and transforming DataFrames
- TidierPlots: plotting data frames
- TidierCats: working with CategoricalArrays
- TidierDates: working with Dates
- TidierStrings: working with strings
- TidierText: text analysis within data frames
- TidierVest: harvesting websites and converting them into data frames
While its origins and inspirations come from R, Tidier is designed from the ground up for Julia. It is an opinionated package in that it diverges from some of the concepts established by other macro-based data analysis packages in Julia. Tidier is different not by accident, but by design in an attempt to be user-friendly and easy to use for data analysts. Tidier does bring a bit of magic because of its reliance on macros, but this is done with an eye on usability, and we are careful to ensure that users retain the ability to override the magic.
Let me show you a quick example focused only on TidierData to introduce you to some of the key concepts in the package.
We will use the Visits to Physician Office dataset, which is abbreviated as ofp.
using TiderData, RDatasets
ofp = dataset("Ecdat", "OFP")
ofp = @clean_names(ofp)
@chain ofp begin
@group_by(region)
@summarize(mean_age = mean(age * 10))
end
4Γ2 DataFrame
Row β region mean_age
β Catβ¦ Float64
ββββββΌβββββββββββββββββββ
1 β other 73.987
2 β midwest 74.0769
3 β noreast 73.9343
4 β west 74.1165
Here, we are calculating the mean age for each region. We first apply the @clean_names()
macro, which converts the column names to snake_case
formatting for convenience. We then group by region and calculate the mean age. Because the age is stored in decades and we want the result in years, we have multiply the age by 10.
A couple things to note in this code:
-
TidierData automatically re-exports the
@chain
macro from Chain.jl. This makes it easy to write data pipelines. Thereβs no requirement to use it, but the docs and examples make heavy use of it. It also automatically re-exports theDataFrame()
function from the DataFrames package. -
region
andage
are referred to in the code as βbareβ names rather than as symbols (i.e.,:region
and:age
).
This is intentional because it lets you write more concise code. The majority of names we refer to in data analysis pipelines refer to columns rather than external variables. You can think of code in TidierData as being within a βdata frameβ scope. If you want to refer to variables outside of the data frame, you can prefix the name with a !!
.
For example, if you had a variable named grouping_variable
that contained the value :region
, you could rewrite the above code as:
grouping_variable = :region
@chain ofp begin
@group_by(!!grouping_variable)
@summarize(mean_age = mean(age * 10))
end
When using the !!
notation, grouping_variable
can also be a vector containing multiple symbols if you want to group by multiple variables. Read more about this here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/interpolation/
- Note that TidierData automatically vectorizes
*
so that the expression is converted tomean(age .* 10)
.When working with non-nested columns of data frames, most data analysis functions are usually intended to be vectorized. Rather than require users to individually vectorize each function, TidierData automatically takes care of the vectorization. While other packages provide a way to vectorize all functions, not all functions make sense to vectorize. For non-nested data,mean()
should essentially never be vectorized. In other functions, such asa in b
,a
should be vectorized butb
should not. See https://bkamins.github.io/julialang/2023/02/10/in.html for details on why this is the case. In TidierData, you can writea in b
, and it will get converted toin.(a, Ref(Set(b)))
.
With all that said, TidierData remains fully configurable. Any function prefixed with a ~
will never get vectorized. You can also directly modify the list of functions not to vectorize (see in the link below). Any function that you vectorize manually, such as by writing mean.()
, will remain vectorized. You can read more details about the behavior here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/
As a result of these concepts and many more syntactic sugar functions like across()
and where()
, very complicated multi-line code can often be reduced to something much more concise, readable, and understandable.
All of this is just scratching the surface of Tidier. Itβs fully functional - it can handle all the common tasks of data analysis - pivoting, joins, nesting/unnesting, grouping, transformations, if_then
/case_when
logic, with full-on support for pipeable plotting and more. Itβs built on top of the best-in-class Julia packages like DataFrames and Makie. If youβve been waiting for Tidier to mature before trying it out, itβs ready for a look.
Tidier brings quite a bit of magic to data analysis in Julia β which some folks will love and others will not.
Acknowledgements: We owe a lot to the R tidyverse community for developing an API that we love and want to see more use of in Julia. We also rely heavily on DataFrames.jl (for TidierData), Makie.jl and AlgebraOfGraphics.jl (for TidierPlots), and many other packages.