Trying to understand macro scope (Tidier package)

Very excited about the direction of the Tidier package as I am coming from R. However, I’m not understanding some of the errors I’m getting and I believe it has to do with the scope of some macros and variables/functions not being found.

I am using these packages.

using DataFrames
using CategoricalArrays
using Tidier

Example data frame.

df = DataFrame(x = 1:4, y = [1,4,3,2], z = ["a", "b", "c", "c"])

Example 1. Subsetting

This works.

value_ = 1
df2 = subset(df, :x => x -> x .== value_)

However, Tidier’s @filter macro cannot find value_.

df3 = @chain df begin
    @filter(x .== value_)
end

ArgumentError: column name :value_ not found in the data frame

Example 2. Creating categorical variables.

This works.

df2 = copy(df)
df2.z = categorical(df2.z, levels = ["a", "b", "c"])

However, using @mutate in Tidier does not.

df3 = @chain df begin
    @mutate(z2 = categorical(z, levels = ["a", "b", "c"]))
end

MethodError: no method matching categorical(::String; levels::Vector{String})

I think in both cases it’s not recognizing the existence of variables or functions in the global scope or in a module.

I’m still getting familiar with Julia - what am I doing wrong?

3 Likes

For the first problem, Tidier.jl defaults to unquote symbols, like x as column names. It requires a !! interpolation, i.e. @filter(df, x .== !!value)

As a side note, this only works in global scope. You can’t actually do this inside a function, since Tidier.jl uses eval at the end of the day to execute the expression, rather than obeys normal macro semantics. So if you were in a function you would have to do @eval(Main, value_ = $value_) in order to use this value.

For the second problem, Tidier.jl runs operations by row, not at the column level. I’m not sure how to do a column-level operation in Tidier.jl.

Not to de-rail the conversation, but DataFramesMeta.jl, which I maintain, does not have these characteristics.

julia> df = DataFrame(x = 1:4, y = [1,4,3,2], z = ["a", "b", "c", "c"]);

julia> value_ = 1;

julia> @chain df begin
           @rsubset :x == value_
       end
1×3 DataFrame
 Row │ x      y      z      
     │ Int64  Int64  String 
─────┼──────────────────────
   1 │     1      1  a

julia> @chain df begin
           @transform :z2 = categorical(:z, levels = ["a", "b", "c"])
       end
4×4 DataFrame
 Row │ x      y      z       z2   
     │ Int64  Int64  String  Cat… 
─────┼────────────────────────────
   1 │     1      1  a       a
   2 │     2      4  b       b
   3 │     3      3  c       c
   4 │     4      2  c       c
6 Likes

Thanks - given that the scoping rules are different for Tidier macros vs others… it seems tricky to use at the moment. It’s strange that operations would be row-wise given that arrays are column-major - I thought operations would be faster applying to whole columns (and often is the desired type of operation).

Anyway, maybe I should give DataFarmesMeta another look. There’s a bit of a learning curve coming from R/pandas so was hoping for a quicker transition, but it seems better integrated with the rest of Julia.

1 Like

This isn’t particularly true. Loops in Julia are fast, as is broadcasting.

I have a tutorial for users coming from dplyr here, emulating the penguins tutorial that is popular. But Tidier.jl has the advantage of sticking to the dplyr API very closely. If it works well for you other than this issue, I would encourage you to continue using it.

2 Likes

I do prefer the dplyr convention but I now see the value of operating on symbols.

I’m not sure this row-wise operators in DataFramesMeta though, even the DataFrames documentation refers to it but I don’t know when you need it - is it for string operations? Coming from R, string operations are a pain point since everything is vectorized (hence operate column-wise) there.

It’s interesting you say you prefer column-wise operations. Most people are the opposite.

Yes, it’s so you can work with strings and other scalars easier. You don’t have to worry about broadcasting.

More subtly,

  1. There are actually some under-the-hood performance improvements to ByRow and specialization.
  2. It makes working with many columns at once easier, for example, a row-sum of a subset of columns
  3. It can make handling missing values easier, since you can put @passmissing in front of an expression and it forwards missing values when called row-wise.
1 Like

Tidier really shouldn’t be using eval in macros. That’s almost always a really bad idea.

6 Likes

I appreciate the comments here and wanted to share my perspective on some of the design decisions as the creator of Tidier.jl (for which the data analysis portion now sits within TidierData.jl).

I’ll talk primarily about the use of bare column names (instead of symbols), how vectorization is handled by Tidier.jl, and touch on the issues related to scoping and why they exist (for now). I also touch on some of these points in my JuliaCon talk, which is posted as part of a live stream but not as an individual talk yet. That may shed some light on at least some these design decisions, but I’ll provide some additional context here.

Why bare column names and not symbols?

The short answer is that R’s tidyverse uses bare column names, which in R is referred to as “non-standard evaluation.” The purpose of the Tidier.jl is to implement tidyverse syntax as a domain-specific language within Julia, so we stick with bare column names.

While the use of bare column names isn’t idiomatic in the context of existing data frames packages, I would argue it’s not unusual either. For example, when you define a data frame, you write:

df = DataFrame(a = 1:10, b = 11:20)

Here, we refer to the column names as bare column names. Certainly, there are ways to use symbols when defining data frames, but I just point out that it’s not that unusual overall — it’s unusual because other macro-based packages use symbols.

The use of bare column names opens up some cool syntax. For example, in Tidier.jl, if you want to calculate a mean across multiple columns, you can write:

@chain df begin
    @summarize(across(a:d, mean))
end

This is concise and allows you to refer to column names almost as if they were unit ranges. This code will calculate a mean across all the columns between a and d (inclusive of a and d). If a had to be referred to as a symbol, this kind of syntax would require a different syntax altogether, and I made the design decision that we would stick with the tidyverse syntax so we could support this pseudo-unit-range-like syntax and other related shorthand.

Additionally, most operations in data frames refer to columns of data (adding them together, calculating summary statistics on them, etc), so in my experience it’s a much more pleasant experience when typing to not have to constantly add :s before each column name. Again, this is a personal preference but is a convention common in R and SQL.

The use of symbols can also present a problem because functions can take symbols as arguments. When a function takes a symbol as an argument, it can create ambiguity to someone reading the code as to whether the symbol refers to a column name or to a symbol being provided to the function. I haven’t tested this issue in other macro-based packages, so I’m not saying it’s a bug — just that the use of symbols also comes with some ambiguity when reading code. Neither is a perfect solution.

How Tidier.jl handles vectorization

The statement that Tidier.jl converts all functions to row-wise operations is not correct, and Tidier.jl gives you full control of which functions to run row-wise vs. column-wise. However, it’s absolutely true that Tidier.jl implements “auto-vectorization” that converts certain functions to happen row-wise.

This behavior is documented here: Auto-vectorization - TidierData.jl

The general principle is that certain functions are typically conducted row-wise (such as adding two columns together), so Tidier.jl automatically converts + into .+ inside of all macros except for @summarize and its alias @summarise.

This means that if you write the following code, the a + b gets “auto-vectorized” into a .+ b by Tidier.jl.

@chain df begin
    @mutate(c = a + b)
end

On the other hand, what if you wanted to subtract a variable by its mean value?

If you wrote @mutate(b = a - mean(a)), the - would get auto-vectorized, but the mean() would not. This is because within the context of a transform, you’d almost never want to vectorize the mean() function. The row-wise mean is just the same the original value of each row, which wouldn’t make any sense to calculate.

So what if you wanted to vectorize mean? You could write @mutate(b = a - mean.(a)), and the mean will be vectorized. If you explicitly indicate you want to vectorize something, Tidier.jl will not interfere with it.

What if you don’t want to vectorize a function? There are two ways to handle this. You can either prefix the function with a tilde (~), which marks that function for Tidier.jl as one that should not be vectorized. You can also add any user-defined functions to the “do-not-vectorize” array by pushing those functions to the array (see line 29 of the TidierData.jl file for details). I’m planning to expose this capability through a function and to document it, though the use of the tilde is already documented.

This auto-vectorization behaves the same across all macros except for @summarize, which never does any auto-vectorization. This part is the same decision made by the DataFrameMacros.jl package, which coincidentally matches the behavior of tidyverse.

Both R and SQL use essentially the same defaults for what is vectorized and what isn’t. All we’ve done in Tidier.jl is implement those behaviors as defaults. This leads to more concise code for exploratory data analysis while still giving you the ability to change the underlying behavior by marking functions that you want to ensure are not vectorized.

Not sure how Tidier.jl is vectorizing your code? You can use TidierData_set(“code”, true), and the generated DataFrames.jl code will be printed to the REPL for you to examine.

Scoping

Not everything in Tidier.jl is scoped differently. For example, you can refer to any user-defined function as long as it is in the scope that can be seen by the macro. It doesn’t have to be in the global scope.

However, if you are referring to a value that is not a column in the data frame, then you have to mark it by prefixing it with a !!. This comes from interpolation syntax from tidyverse. We don’t use the default Julia interpolation syntax because of the additional syntactic sugar we need to add before the expression is evaluated. For example, if you have @mutate(a = a + pi), that will assume that both a and pi refer to column names. If you are instead referring to the value of pi and not a column name, then you can either write @mutate(a = a + !!pi) or @mutate(a = a + Main.pi). Again, the + gets auto-vectorized by Tidier.jl to ensure that it the addition happens element-wise.

Right now, if you wanted to refer to a variable in the local scope, there isn’t a way to do it without defining it as a global variable. This is a limitation (for now), and it’s mentioned right in the documentation: Interpolation - TidierData.jl. This is essentially the only place in the entire package where we rely on eval() — all of the other functionality within TidierData.jl is accomplished through pure interpolation. I think I have a fix that will resolve this problem (and remove the need to use eval(), so this likely isn’t going to be a permanent limitation.

tl;dr

Tidier.jl is an opinionated domain-specific language. For folks who’ve used tidyverse or SQL, I think the design decisions actually make code more convenient, concise, and readable. I fully acknowledge that in doing so it behaves differently than pure Julia code, but Tidier.jl gives you control if you want to write pure Julia code — any function you vectorize will not be un-vectorized by TidierData.jl, and anything you mark as not being vectorized will not be vectorized.

This post is in no way a commentary on DataFramesMeta.jl. I think DFMeta is great. I just wanted to communicate that the design decisions in Tidier.jl are intentional and not a result of an accident or a lack of knowledge around the related packages in this space.

The package is also 7 months old, so a lot of development has focused on getting this package feature-complete and speedy (by minimizing extra ops). There is still room to further evolve, and I think you’ll see further improvements to the parsing engine in the coming months.

8 Likes

Interesting - it’s a different way of thinking I have to get used to. In that everything is a vector in R, and an R data frame is a list of vectors, many of the operations you describe (handling of missing values, summing across rows) are handled on column vectors of the data frame so I don’t think of it as row-wise operations.

Thanks for providing Tidier and these detailed answers - this is certainly an attractive option coming from years with R/tidyverse. Many decisions made there I either like a lot or have gotten used to, so Tidier a promising bridge into Julia (I’ve also been teaching with a lot of R/tidyverse materials so I was originally planning to convert them with Tidier - this could certainly cut down on the time for translation).

I see that the symbol/bare column names decision has merits on both sides so I guess I don’t feel too strongly there.

Not sure if you prefer that I ask more details here or on the Tidier git repo, but is there a way to make CategoricalArrays.categorical() work within the @mutate macro? It seems like I want it vectorized but it is trying to convert it to row-wise if I understand correctly. Using ~ is for the opposite case when I want to prevent autovectorization, but is there a way to force it?

1 Like

@hatmatrix, you have hit on a really important point. Whereas nearly all functions in R are written to directly handle vectors as arguments (since everything in R is a vector), most functions in Julia are written to work on scalars. To make them work on vectors, you have to vectorize them.

For example, addition only works on scalars. If you want to add vectors together in an element-wise fashion, you normally have to vectorize the + function to .+ to make it work on vectors. If you have a user-defined function named please_add() which defined to add scalar values, you can convert this to please_add.() to make it work on vectors. The addition of the . isn’t just syntactic sugar — it works on any function, and the vectorization is handled by the Julia compiler.

There are some functions that are inherently designed to work on vectors. One of those is mean() and another is CategoricalArrays.categorical(). The reason for categorical() is fairly straightforward. The function needs to see the full vector so it can infer what the levels are in the categorical array — the same way that factor() and as_factor() work in R.

So the question you asked was, is there a way to let categorical() work inside of @mutate().

The answer is yes. For right now, you can prefix it with a tilde. I thought that you didn’t need a tilde because categorical() should be on the list of functions not to vectorize. However, when I checked line 29 of this file (https://github.com/TidierOrg/TidierData.jl/blob/56c41425177d83758cb9e1ba679a8d806afd0f04/src/TidierData.jl#L29), I see that as_categorical() and is_categorical() are in the not_vectorized array, but categorical() currently is not. I’m going to open an issue to fix this. If you don’t want to use a tilde, you can use as_categorical() instead, which comes from the TidierCats.jl package (which is similar to the R forcats package). See the documentation for TidierCats.jl here: GitHub - TidierOrg/TidierCats.jl: 100% Julia implementation of the forcats R package. TidierCats is automatically re-exported by Tidier.jl.

So this should work if a is a string.

@chain df begin
    @mutate(a = ~categorical(a))
end

We will fix it so the tilde isn’t required.

This also works…


@chain df begin
    @mutate(a = as_categorical(a))
end

And to answer your last question, you don’t generally need to force auto-vectorization. Unless a function is specifically listed in that not_vectorized array, it is automatically vectorized by Tidier. And if you did want to vectorize a function that is listed in the not_vectorized array (e.g., mean()), you can explicitly vectorize it by adding a period at the end (e.g., mean.()), and Tidier will respect the vectorization.

Hope that makes sense. Happy to clarify. Feel free to respond here or to open an issue on TidierData.jl if you run into roadblocks or get stuck.

Also, don’t forget to check out our other helper packages that bring additional R tidyverse functionality into Julia.

For lubridate, stringr, forcats, and rvest, we’ve implemented TidierDates.jl, TidierStrings.jl, TidierCats.jl, and TidierVest.jl, respectively.

1 Like

I also teach a tidyverse course (ML4LHS Lab - LHS 610) and plan to eventually convert that to Tidier as well.

2 Likes

I’m not fully sure what you intend to express with this phrase, but it seems useful to note that dplyr and SQL have fairly different semantics regarding any formal concept of “vectorization”. Some relevant contrasts:

  1. Vectorization exists as a implicitly defined core concept in R, but does not exist at all as a concept in SQL. For example, the words “vectorized” and “vectorization” occur zero times in the SQL 2003 standard, but do occur in the R language definition despite never being formally defined.
  2. In dplyr, expressions mostly behave as if you operated on entire columns as atomic vectors. In SQL, expressions mostly behave as if you operated in isolation on each row.
  3. When the SQL community uses the word “vectorization”, they almost always use it to refer to exploiting CPU-level vectorization primitives like SIMD.
  4. In SQL, any given function is explicitly defined as either an aggregation function or not, so there is no ambiguity whether aggregation across rows is intended because the function definition itself specifies this. (This is a tricky concept to map into Julia because of the function/method distinction in Julia that SQL lacks.)

To point out why these distinctions matter, let’s combine your b = a - mean(a) example with the assumption that we’re operating on a table in which the column a has elements of array type. Imagine the following situation in Julia:

using DataFrames

df = DataFrame(a = [[1, 2, 3], [4, 5, 6]])

What could b = a .- mean(a) mean here?

  1. You first compute the mean across all rows (producing a 3-element array), then subtract each row’s vector against that 3-element array. So something like broadcast(-, a, Ref(mean(a))) if a is a top-level vector.
  2. You compute the mean in each row in isolation across the 3 elements of each array, then subtract the single resulting number per row from each element in that row’s vector. So something like map(r -> r .- mean(r), a) if a is a top-level vector.

I push on this kind of contrast between dplyr and SQL because I suspect you will need to deal with this at some point in Tidier.jl given that Julia’s more expressive type system will lead users to create tables like the one shown above, whereas dplyr will not need to confront the same problem given R’s less flexible type system.

4 Likes

Great point, @johnmyleswhite. You can ignore my reference to SQL — I was primarily referring to vectorization in the R way, where R functions usually operate on vectors rather than scalars (because there are no scalars).

In dplyr, I think we would first unnest the data before running a line equivalent to a .- mean(a). We haven’t implemented the @unnest_* macros in Tidier.jl yet but this is on the roadmap (and there are open issues related to this).

Tidier.jl aims to implement the dplyr approach to most things, including nested data. This inherently means that it will be limited in some ways, but as someone who has used dplyr for years, I think it covers many, if not most, use cases.

Again, it’s a matter of personal preference, and as data structures get more complex and nested, R users would normally resort to using the purrr package, which we have not yet implemented in Tidier.jl (and probably won’t implement).

2 Likes

Big update thanks to some initial code in a PR from @vchuravy:

  • Variable scoping for interpolation has been fixed and no longer relies on eval(). This means that interpolation works smoothly inside functions and within loops and no longer relies on global variables. See the updated documentation for examples here: Interpolation - TidierData.jl
  • categorical() is now marked as a non-vectorized function so you no longer need to prefix it with a tilde.

Bumped the version for TidierData.jl to 0.12.0, and a release is on its way to the Julia registry.

8 Likes

@kdpsingh, Thanks a lot.

So I’ve gotten the original example to work with the ~ until I wait for the next version.

using DataFrames
using Tidier
using CategoricalArrays
using Printf

df = DataFrame(a = [.1, .2, .1, .2], b = ["x", "y", "z", "x"])

## this works
df2 = @chain df begin
    @mutate(b = ~categorical(b, levels = ["x", "y", "z"]))
end

However, I have encountered a new problem.

## this doesn't work
df2 = @chain df begin
    @mutate(a_label = ~categorical(map(x -> @sprintf("Var = %.1f", x), a)))
end

I’m still not sure on the scoping rules since the last block of code returns.

LoadError: UndefVarError: @sprintf not defined

I can’t seem to use “interpolation” (!!) on the anonymous function or macro to get @mutate to recognize it. Outside of @chain/mutate it works:

julia> categorical(map(x -> @sprintf("Var = %.1f", x), df.a))
4-element CategoricalArray{String,1,UInt32}:
 "Var = 0.1"
 "Var = 0.2"
 "Var = 0.1"
 "Var = 0.2" 

Anyone have an idea how to make this work here? Thanks!

1 Like

Very nice course! My course is not as focused on the programming aspects of the course as much as having students generate output that they can interpret, but the main advantage of Julia would be combining data analysis with scientific modeling tools so I can cover these topics with a single language. Of course Python resides in this space currently, but the data analysis dimension is less straightforward compared to R, and Julia seems closer to R in this regard (in terms of clarity of ideas through syntax).

This is a 2 language problem that is rarely mentioned, but it’s the reason I’m here!

3 Likes

Indeed! I also frame it as the other 2-language problem. This article was linked on Hacker News recently and I thought this would also be a good place for Julia to step in also (in pedagogy of implementing physical models).

Thanks for sharing an example. The reason TidierData.jl is having difficulty with this is because it’s having difficulty with escaping and vectorizing the @sprintf macro. The inability to escape it is why it generates an error saying it can’t find it. However, if you wrap this macro inside a function, everything works as expected.

temp_fn(x) = @sprintf("Var = %.1f", x)

@chain df begin
  @mutate(a_label = ~categorical(temp_fn(a)))
end

And if you update your version of TidierData.jl to 0.12.0 (which is already on the registry), then you can remove the tilde prefix from categorical.

Under the hood, what’s happening here is that temp_fn() is being converted into temp_fn.() (the vectorized form), so you don’t need to use map() to iterate over it. And categorical() is being left as-is because it is prefixed by a tilde (and because in the latest version, TidierData.jl knows it shouldn’t be vectorized).