[ANN] (Belatedly) Announcing Tidier.jl

kdpsingh · January 24, 2024, 10:24pm

With the 1-year anniversary of Tidier.jl coming up soon, I wanted to put together an official (if slightly belated) package announcement describing Tidier.jl to invite you to give it a try.

Tidier.jl originally began as a package intended for working with DataFrames based on the R dplyr and tidyr packages as part of the R tidyverse ecosystem. That portion of Tidier.jl has since been split off into its own package (TiderData.jl), with Tidier.jl focused much more broadly on implementing the entire R tidyverse ecosystem in Julia.

In its current form, Tidier is a meta-package intended for generating, analyzing, transforming, and visualizing data frames. Tidier contains and re-exports the following packages:

TidierData: analyzing and transforming DataFrames
TidierPlots: plotting data frames
TidierCats: working with CategoricalArrays
TidierDates: working with Dates
TidierStrings: working with strings
TidierText: text analysis within data frames
TidierVest: harvesting websites and converting them into data frames

While its origins and inspirations come from R, Tidier is designed from the ground up for Julia. It is an opinionated package in that it diverges from some of the concepts established by other macro-based data analysis packages in Julia. Tidier is different not by accident, but by design in an attempt to be user-friendly and easy to use for data analysts. Tidier does bring a bit of magic because of its reliance on macros, but this is done with an eye on usability, and we are careful to ensure that users retain the ability to override the magic.

Let me show you a quick example focused only on TidierData to introduce you to some of the key concepts in the package.

We will use the Visits to Physician Office dataset, which is abbreviated as ofp.

using TiderData, RDatasets
ofp = dataset("Ecdat", "OFP")
ofp = @clean_names(ofp)

@chain ofp begin
    @group_by(region)
    @summarize(mean_age = mean(age * 10))
end

4×2 DataFrame
 Row │ region   mean_age 
     │ Cat…     Float64  
─────┼───────────────────
   1 │ other     73.987
   2 │ midwest   74.0769
   3 │ noreast   73.9343
   4 │ west      74.1165

Here, we are calculating the mean age for each region. We first apply the @clean_names() macro, which converts the column names to snake_case formatting for convenience. We then group by region and calculate the mean age. Because the age is stored in decades and we want the result in years, we have multiply the age by 10.

A couple things to note in this code:

TidierData automatically re-exports the @chain macro from Chain.jl. This makes it easy to write data pipelines. There’s no requirement to use it, but the docs and examples make heavy use of it. It also automatically re-exports the DataFrame() function from the DataFrames package.
region and age are referred to in the code as “bare” names rather than as symbols (i.e., :region and :age).

This is intentional because it lets you write more concise code. The majority of names we refer to in data analysis pipelines refer to columns rather than external variables. You can think of code in TidierData as being within a “data frame” scope. If you want to refer to variables outside of the data frame, you can prefix the name with a !!.

For example, if you had a variable named grouping_variable that contained the value :region, you could rewrite the above code as:

grouping_variable = :region

@chain ofp begin
    @group_by(!!grouping_variable)
    @summarize(mean_age = mean(age * 10))
end

When using the !! notation, grouping_variable can also be a vector containing multiple symbols if you want to group by multiple variables. Read more about this here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/interpolation/

Note that TidierData automatically vectorizes * so that the expression is converted to mean(age .* 10).When working with non-nested columns of data frames, most data analysis functions are usually intended to be vectorized. Rather than require users to individually vectorize each function, TidierData automatically takes care of the vectorization. While other packages provide a way to vectorize all functions, not all functions make sense to vectorize. For non-nested data, mean() should essentially never be vectorized. In other functions, such as a in b, a should be vectorized but b should not. See https://bkamins.github.io/julialang/2023/02/10/in.html for details on why this is the case. In TidierData, you can write a in b, and it will get converted to in.(a, Ref(Set(b))).

With all that said, TidierData remains fully configurable. Any function prefixed with a ~ will never get vectorized. You can also directly modify the list of functions not to vectorize (see in the link below). Any function that you vectorize manually, such as by writing mean.(), will remain vectorized. You can read more details about the behavior here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/

As a result of these concepts and many more syntactic sugar functions like across() and where(), very complicated multi-line code can often be reduced to something much more concise, readable, and understandable.

All of this is just scratching the surface of Tidier. It’s fully functional - it can handle all the common tasks of data analysis - pivoting, joins, nesting/unnesting, grouping, transformations, if_then/case_when logic, with full-on support for pipeable plotting and more. It’s built on top of the best-in-class Julia packages like DataFrames and Makie. If you’ve been waiting for Tidier to mature before trying it out, it’s ready for a look.

Tidier brings quite a bit of magic to data analysis in Julia – which some folks will love and others will not.

Acknowledgements: We owe a lot to the R tidyverse community for developing an API that we love and want to see more use of in Julia. We also rely heavily on DataFrames.jl (for TidierData), Makie.jl and AlgebraOfGraphics.jl (for TidierPlots), and many other packages.

aplavin · January 24, 2024, 11:08pm

Is it correct to say that Tidier is aimed at those coming from R? Asking as someone who had no (positive) experience with that language (:, but doing lots of data manipulation, analysis, and plotting in Julia. In particular, I see that comparisons in the docs are often done with tidyverse/ggplot/… or in that context.

kdpsingh · January 24, 2024, 11:27pm

Thanks @aplavin for the question.

I think the API will feel natural to folks coming from R’s tidyverse ecosystem, but there’s nothing in the API that is specific to R per se. There are implementations of the tidyverse API in Python (eg, Ibis, tidypolars, siuba, and others) that are also fairly nice to use. Even polars and pandas borrow some of these concepts in Python.

At this point, I’m mostly a Julia-first user. This is my invitation for other Julia users who have never used R to take a look at Tidier.jl.

aplavin · January 24, 2024, 11:33pm

Thanks! Then some comparisons with Julia would be useful, in addition to those with R.
Like, “this is how you do a plot with AoG.jl, and here see how much easier the same plot is with DataFrames and TidierPlots” (:

kdpsingh · January 24, 2024, 11:56pm

That’s a great suggestion. We can definitely add some documentation comparing the Tidier.jl way of doing common tasks to other commonly used Julia packages.

DoktorMike · January 25, 2024, 5:57pm

This is so great. The tidyverse system in R is so amazing and it’s one of the things I keep going back to R to use for all my table like data wrangling. Now I will not have to reach for R as often as before.

Tobe_Freeman · January 30, 2024, 10:24am

I understand your negativity towards R at a certain level. But what I would also say is the R’s Tidyverse was a major advance in data science. Anything that extracts the elegance, ease of use the pure, mathy-granite-like-robustness of the Tidyverse and delivers it in a hot, new language like Julia, seems like a wonderful idea!

aplavin · January 30, 2024, 6:36pm

It’s less “negativity”, more “lack of positivity” (: I only used R a little bit during a uni course about a decade ago, that’s it. That just means comparisons with R are less understandable for me, and presumably many other Julia users.

A thing that I like about Julia, and that surprised me at first, is that there’s no strong need to use fancy and very specialized data structures (datatables, dataframes) for data processing. Simple arrays can take you a looong way! I personally never find myself in need of table-specialized data structures anymore…

But clearly, R/tidyverse is among the most popular data analysis tools in the world, together with Excel (definitely #1), and Python/Pandas (probably #2). So there are lots of opportunities to grab something from those ecosystems and adapt to Julia.

kdpsingh · February 26, 2024, 4:35pm

Announcing TidierData.jl v0.15.0

Inspired by DataFramesMeta.jl, TidierData.jl now supports begin-end blocks within all macros. This makes it easier to write parentheses-free code that spans multiple lines. This (optional) form also makes it easier to comment out portions of your code.

Here’s an example of a @filter with one of the conditions commented out:

@chain movies begin
  @filter begin
    # Year >= 2000
    Rating >= 9
  end
  @slice 1:5
  @select 1:5
end

5×5 DataFrame

Row	Title	Year	Length	Budget	Rating
	String	Int32	Int32	Int32?	Float64
1	+1 -1	1987	7	missing	9.4
2	100 Years at the Movies	1994	9	missing	9.2
3	13 Lakes	2004	135	missing	9.0
4	2wks, 1yr	2002	104	missing	9.4
5	500 Years Later	2005	106	missing	9.3

For a bit more explanation on how this works within the larger context of piping in TidierData.jl, check out the new docs page here: Piping - TidierData.jl (tidierorg.github.io)

kdpsingh · February 28, 2024, 1:06am

Announcing TidierPlots v0.6.0

The biggest change is under the hood in that the package now relies directly on the new-and-evolving Makie API and no longer uses AoG as an intermediary. This will help us better implement the ggplot2 interface while also making the package more compatible with Makie.

PS. Shoutout to Randy Boyes for all the work on the package.

Changelog:

Dependency on AlgebraOfGraphics.jl removed
Support for positional aes specification (e.g. @aes(x, y) instead of @aes(x = x, y = y))
Support for horizontal bars in geom_bar and geom_col
All Makie plotting arguments are supported via passthrough (as either aes or args)
Breaks: geom_smooth, geom_contour, facet_wrap, facet_grid, all tests

kdpsingh · April 7, 2024, 11:28pm

Announcing TidierDB.jl v0.1.0 (pending registry approval)

We are excited to release TidierOrg/TidierDB.jl, a package focused on bringing the syntax of Tidier.jl to multiple SQL backends, making it possible to analyze data directly on databases without needing to copy the entire database into memory.

Currently supported backends include:

DuckDB (the default) set_sql_mode(:duckdb)
ClickHouse set_sql_mode(:clickhouse)
SQLite set_sql_mode(:lite)
MySQL set_sql_mode(:mysql)
MSSQL set_sql_mode(:mssql)
Postgres set_sql_mode(:postgres)

TidierDB.jl supports the following top-level macros:

@arrange
@group_by
@filter
@select
@mutate, which supports across()
@summarize and @summarise, which supports across()
@distinct
@left_join, @right_join, @inner_join (slight syntax differences from TidierData.jl)
@count
@slice_min, @slice_max, @slice_sample
@window_order and window_frame
@show_query
@collect

Supported helper functions for most backends include:

across()
desc()
if_else() and case_when()
n()
starts_with(), ends_with(), and contains()
as_float(), as_integer(), and as_string()
is_missing()
missing_if() and replace_missing()

From TidierStrings.jl:

str_detect, str_replace, str_replace_all, str_remove_all, str_remove

From TidierDates.jl:

year, month, day, hour, min, second, floor_date, difftime

Supported aggregate functions (varying by backend):

mean, minimium, maximum, std, sum, cumsum, cor, cov, var
copy_to (for DuckDB, MySQL, SQLite)

DuckDB specifically enables copy_to to directly read in .parquet, .json, .csv, and .arrow files, including https file paths.

path = "file_path.parquet"
copy_to(conn, file_path, "table_name")

What is the recommended way to use TidierDB?

Typically, you will want to use TidierDB alongside TidierData because there are certain functionality (such as pivoting) which are only supported in TidierData and can only be performed on data frames.

Our recommended path for using TidierDB is to import the package so that there are no namespace conflicts with TidierData. Once TidierDB is integrated with Tidier, then Tidier will automatically load the packages in this fashion.

First, let’s develop and execute a query using TidierDB. Notice that all top-level macros and functions originating from TidierDB start with a DB prefix. Any functions defined within macros do not need to be prefixed within DB because they are actually pseudofunctions that are in actuality converted into SQL code.

Even though the code reads similarly to TidierData, note that no computational work actually occurs until you run DB.@collect(), which runs the SQL query and instantiates the result as a DataFrame.

using TidierData
import TidierDB as DB

mem = DB.duckdb_open(":memory:");
db = DB.duckdb_connect(mem);
path = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
DB.copy_to(db, path, "mtcars");

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@collect
end

2×5 DataFrame
 Row │ cyl     mpg       mpg_squared  mpg_rounded  mpg_efficiency 
     │ Int64?  Float64?  Float64?     Float64?     String?        
─────┼────────────────────────────────────────────────────────────
   1 │      4   27.3444      747.719         27.0  efficient
   2 │      6   19.7333      389.404         20.0  moderate

What if we wanted to pivot the result?

We cannot do this using TidierDB. However, we can call @pivot_longer() from TidierData after the result of the query has been instantiated as a DataFrame, like this:

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@collect
    @pivot_longer(everything(), names_to = "variable", values_to = "value")
end

10×2 DataFrame
 Row │ variable        value     
     │ String          Any       
─────┼───────────────────────────
   1 │ cyl             4
   2 │ cyl             6
   3 │ mpg             27.3444
   4 │ mpg             19.7333
   5 │ mpg_squared     747.719
   6 │ mpg_squared     389.404
   7 │ mpg_rounded     27.0
   8 │ mpg_rounded     20.0
   9 │ mpg_efficiency  efficient
  10 │ mpg_efficiency  moderate

What SQL query does TidierDB generate for a given piece of Julia code?

We can replace DB.collect() with DB.@show_query to reveal the underlying SQL query being generated by TidierDB. To handle complex queries, TidierDB makes heavy use of Common Table Expressions (CTE), which are a useful tool to organize long queries.

@chain DB.db_table(db, :mtcars) begin
    DB.@filter(!starts_with(model, "M"))
    DB.@group_by(cyl)
    DB.@summarize(mpg = mean(mpg))
    DB.@mutate(mpg_squared = mpg^2, 
               mpg_rounded = round(mpg), 
               mpg_efficiency = case_when(
                                 mpg >= cyl^2 , "efficient",
                                 mpg < 15.2 , "inefficient",
                                 "moderate"))            
    DB.@filter(mpg_efficiency in ("moderate", "efficient"))
    DB.@arrange(desc(mpg_rounded))
    DB.@show_query
end

WITH cte_1 AS (
SELECT *
        FROM mtcars
        WHERE NOT (starts_with(model, 'M'))),
cte_2 AS (
SELECT cyl, AVG(mpg) AS mpg
        FROM cte_1
        GROUP BY cyl),
cte_3 AS (
SELECT  cyl, mpg, POWER(mpg, 2) AS mpg_squared, ROUND(mpg) AS mpg_rounded, CASE WHEN mpg >= POWER(cyl, 2) THEN 'efficient' WHEN mpg < 15.2 THEN 'inefficient' ELSE 'moderate' END AS mpg_efficiency
        FROM cte_2 ),
cte_4 AS (
SELECT *
        FROM cte_3
        WHERE mpg_efficiency in ('moderate', 'efficient'))  
SELECT *
        FROM cte_4  
        ORDER BY mpg_rounded DESC

TidierDB is already quite fully-featured, supporting advanced TidierData functions like `across()` for multi-column selection.

@chain DB.db_table(db, :mtcars) begin
    DB.@group_by(cyl)
    DB.@summarize(across((starts_with("a"), ends_with("s")), (mean, sum)))
    DB.@collect
end

3×5 DataFrame
 Row │ cyl     mean_am   mean_vs   sum_am   sum_vs  
     │ Int64?  Float64?  Float64?  Int128?  Int128? 
─────┼──────────────────────────────────────────────
   1 │      4  0.727273  0.909091        8       10
   2 │      6  0.428571  0.571429        3        4
   3 │      8  0.142857  0.0             2        0

Bang-bang !! interpolation for columns and values is also supported.

There are a few subtle but important differences from Tidier.jl outlined here.

Missing a function or backend?

You can use any existing SQL function within @mutate with the correct SQL syntax and it should just work.

But if you run into problems please open an issue, and we will be happy to take a look!

DoktorMike · April 8, 2024, 5:44am

This is such a joy to use. I’m used to dbplyr from the R ecosystem which I frequently grabbed for when needing to do analysis on data residing in PostgreSQL. A very welcome addition to my toolbox in Julia!

mrufsvold · April 8, 2024, 4:20pm

Working on a presentation for the R User group at my work to demonstrate how awesome Tidier.jl and Julia are. I’m not a Tidyverse user myself, but Tidier.jl is such a joy to use. Thank you!

co1emi11er · April 8, 2024, 9:31pm

This is such a great package!

I had somewhat of an off-topic question when it relates to the output… I have seen in the docs and here that the output of the dataframe has syntax-highlighting or colored outputs. I love that. I am using OhMyREPL but I am not getting that in outputs. Is that something that can be done automatically in the REPL?

kdpsingh · April 12, 2024, 6:53am

Announcing TidierFiles.jl v0.1.0 (pending registry approval)

TidierFiles brings a consistent interface to the reading and writing of tabular data, including a consistent syntax to read files locally versus from the web (http/https) and consistent keyword arguments across data formats.

It provides a wrapper around CSV.jl, XLSX.jl, and ReadStatTables.jl. It’s narrower in scope as compared to FileIO.jl but provides better synchronization of keyword arguments across wrapped packages.

Currently supported file types:

read_csv and write_csv
read_tsv and write_tsv
read_xlsx and write_xlsx
read_delim and write_delim
read_table and write_table
read_fwf
read_sav and write_sav (.sav and .por)
read_sas and write_sas (.sas7bdat and .xpt)
read_dta and write_dta (.dta)

kdpsingh · April 12, 2024, 2:34pm

This is a great question. The colors are because of Documenter.jl performing code and output styling for the documentation. In the REPL, you’re right that outputs aren’t colored. I wonder if someone has looked at this, but maybe it’s doable?

I wonder if @bkamins knows if there’s a way to get colored output for DataFrames.

bkamins · April 12, 2024, 2:43pm

See Home · Pretty Tables

hatmatrix · June 9, 2024, 9:22pm

Hi @kdpsingh

Thanks a lot for these updates - really happy to see the progress on these packages! At every turn the decisions seem well thought-through so wanted to ask - is there a semantic origin of ~ for escaping vectorization in Julia? I’m only familiar with a similar use of $ in preventing broadcasting of a function.

Tamas_Papp · July 30, 2024, 12:17pm

Reading the source of some TidierOrg packages, I have noticed that frequently docstrings are defined in a separate source file, not at the function definition. I am just curious why this is done.

kdpsingh · July 30, 2024, 4:11pm

Early on, we spent a lot of time making sure that the docstrings were consistent across functions, and that examples were reused across macros wherever appropriate. Having the docstrings all in one place made this easier to do.

I had come across other packages that also did this, so it’s not unique to Tidier. Once the base package used this style, the other ones also adopted it.

Topic		Replies	Views
Trying to understand macro scope (Tidier package) General Usage	20	1723	September 11, 2023
What's the latest and greatest in data in Julia Data	29	2131	August 15, 2024
What's the current (spring 2024) canonical approach to data science in Julia? General Usage dataframes	34	4186	April 8, 2024
[ANN] SQLCollections.jl – use Julia data manipulation functions for databases Package Announcements query , database	31	1280	October 28, 2024
Common API for tabular data backends Data	44	2656	August 28, 2020