I appreciate the comments here and wanted to share my perspective on some of the design decisions as the creator of Tidier.jl (for which the data analysis portion now sits within TidierData.jl).
I’ll talk primarily about the use of bare column names (instead of symbols), how vectorization is handled by Tidier.jl, and touch on the issues related to scoping and why they exist (for now). I also touch on some of these points in my JuliaCon talk, which is posted as part of a live stream but not as an individual talk yet. That may shed some light on at least some these design decisions, but I’ll provide some additional context here.
Why bare column names and not symbols?
The short answer is that R’s tidyverse uses bare column names, which in R is referred to as “non-standard evaluation.” The purpose of the Tidier.jl is to implement tidyverse syntax as a domain-specific language within Julia, so we stick with bare column names.
While the use of bare column names isn’t idiomatic in the context of existing data frames packages, I would argue it’s not unusual either. For example, when you define a data frame, you write:
df = DataFrame(a = 1:10, b = 11:20)
Here, we refer to the column names as bare column names. Certainly, there are ways to use symbols when defining data frames, but I just point out that it’s not that unusual overall — it’s unusual because other macro-based packages use symbols.
The use of bare column names opens up some cool syntax. For example, in Tidier.jl, if you want to calculate a mean across multiple columns, you can write:
@chain df begin
@summarize(across(a:d, mean))
end
This is concise and allows you to refer to column names almost as if they were unit ranges. This code will calculate a mean across all the columns between a
and d
(inclusive of a
and d
). If a
had to be referred to as a symbol, this kind of syntax would require a different syntax altogether, and I made the design decision that we would stick with the tidyverse syntax so we could support this pseudo-unit-range-like syntax and other related shorthand.
Additionally, most operations in data frames refer to columns of data (adding them together, calculating summary statistics on them, etc), so in my experience it’s a much more pleasant experience when typing to not have to constantly add :
s before each column name. Again, this is a personal preference but is a convention common in R and SQL.
The use of symbols can also present a problem because functions can take symbols as arguments. When a function takes a symbol as an argument, it can create ambiguity to someone reading the code as to whether the symbol refers to a column name or to a symbol being provided to the function. I haven’t tested this issue in other macro-based packages, so I’m not saying it’s a bug — just that the use of symbols also comes with some ambiguity when reading code. Neither is a perfect solution.
How Tidier.jl handles vectorization
The statement that Tidier.jl converts all functions to row-wise operations is not correct, and Tidier.jl gives you full control of which functions to run row-wise vs. column-wise. However, it’s absolutely true that Tidier.jl implements “auto-vectorization” that converts certain functions to happen row-wise.
This behavior is documented here: Auto-vectorization - TidierData.jl
The general principle is that certain functions are typically conducted row-wise (such as adding two columns together), so Tidier.jl automatically converts +
into .+
inside of all macros except for @summarize
and its alias @summarise
.
This means that if you write the following code, the a + b
gets “auto-vectorized” into a .+ b
by Tidier.jl.
@chain df begin
@mutate(c = a + b)
end
On the other hand, what if you wanted to subtract a variable by its mean value?
If you wrote @mutate(b = a - mean(a))
, the -
would get auto-vectorized, but the mean()
would not. This is because within the context of a transform, you’d almost never want to vectorize the mean()
function. The row-wise mean is just the same the original value of each row, which wouldn’t make any sense to calculate.
So what if you wanted to vectorize mean? You could write @mutate(b = a - mean.(a))
, and the mean will be vectorized. If you explicitly indicate you want to vectorize something, Tidier.jl will not interfere with it.
What if you don’t want to vectorize a function? There are two ways to handle this. You can either prefix the function with a tilde (~
), which marks that function for Tidier.jl as one that should not be vectorized. You can also add any user-defined functions to the “do-not-vectorize” array by pushing those functions to the array (see line 29 of the TidierData.jl file for details). I’m planning to expose this capability through a function and to document it, though the use of the tilde is already documented.
This auto-vectorization behaves the same across all macros except for @summarize
, which never does any auto-vectorization. This part is the same decision made by the DataFrameMacros.jl package, which coincidentally matches the behavior of tidyverse.
Both R and SQL use essentially the same defaults for what is vectorized and what isn’t. All we’ve done in Tidier.jl is implement those behaviors as defaults. This leads to more concise code for exploratory data analysis while still giving you the ability to change the underlying behavior by marking functions that you want to ensure are not vectorized.
Not sure how Tidier.jl is vectorizing your code? You can use TidierData_set(“code”, true)
, and the generated DataFrames.jl code will be printed to the REPL for you to examine.
Scoping
Not everything in Tidier.jl is scoped differently. For example, you can refer to any user-defined function as long as it is in the scope that can be seen by the macro. It doesn’t have to be in the global scope.
However, if you are referring to a value that is not a column in the data frame, then you have to mark it by prefixing it with a !!
. This comes from interpolation syntax from tidyverse. We don’t use the default Julia interpolation syntax because of the additional syntactic sugar we need to add before the expression is evaluated. For example, if you have @mutate(a = a + pi)
, that will assume that both a
and pi
refer to column names. If you are instead referring to the value of pi
and not a column name, then you can either write @mutate(a = a + !!pi)
or @mutate(a = a + Main.pi)
. Again, the +
gets auto-vectorized by Tidier.jl to ensure that it the addition happens element-wise.
Right now, if you wanted to refer to a variable in the local scope, there isn’t a way to do it without defining it as a global variable. This is a limitation (for now), and it’s mentioned right in the documentation: Interpolation - TidierData.jl. This is essentially the only place in the entire package where we rely on eval()
— all of the other functionality within TidierData.jl is accomplished through pure interpolation. I think I have a fix that will resolve this problem (and remove the need to use eval()
, so this likely isn’t going to be a permanent limitation.
tl;dr
Tidier.jl is an opinionated domain-specific language. For folks who’ve used tidyverse or SQL, I think the design decisions actually make code more convenient, concise, and readable. I fully acknowledge that in doing so it behaves differently than pure Julia code, but Tidier.jl gives you control if you want to write pure Julia code — any function you vectorize will not be un-vectorized by TidierData.jl, and anything you mark as not being vectorized will not be vectorized.
This post is in no way a commentary on DataFramesMeta.jl. I think DFMeta is great. I just wanted to communicate that the design decisions in Tidier.jl are intentional and not a result of an accident or a lack of knowledge around the related packages in this space.
The package is also 7 months old, so a lot of development has focused on getting this package feature-complete and speedy (by minimizing extra ops). There is still room to further evolve, and I think you’ll see further improvements to the parsing engine in the coming months.