Tidier ! ! interpolation of external variables

Invoking the @filter function in the TidierData package, I’m trying to use the the !! syntax to refer to a variable not in the data frame as described here:

If you want to refer to an object a that is defined outside of the data frame, then you can write !!a, which we refer to as “bang-bang interpolation.”

Trying this out, it works with some variables but not if variables are other data frames.

using DataFrames
using Tidier

First case - regular variable. The following example works.

julia> df = DataFrame(x = 1:5)
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5
julia> y = 5
5

julia> @filter(df, x != !!y)
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4

Second case - the external variable is another data frame. This doesn’t work.

julia> df2 = DataFrame(y = 5)
1×1 DataFrame
 Row │ y     
     │ Int64 
─────┼───────
   1 │     5
julia> @filter(df, x != !!df2.y)
ERROR: UndefVarError: `df2` not defined in `TidierData`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
  [1] (::var"#22#24")(x::Vector{Int64})
    @ Main ~/.julia/packages/TidierData/Dw20l/src/parsing.jl:141
  [2] (::DataFrames.var"#610#611"{var"#22#24"})(x::Vector{Int64})
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/subset.jl:66
  [3] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any})
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:562
  [4] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
  [5] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
  [6] manipulate(df::DataFrame, cs::Any; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
  [7] select(df::DataFrame, args::Any; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
  [8] _get_subset_conditions(df::DataFrame, ::Base.RefValue{Any}, skipmissing::Bool, threads::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/subset.jl:113
  [9] subset(df::DataFrame, args::Any; skipmissing::Bool, view::Bool, threads::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/abstractdataframe/subset.jl:284
 [10] macro expansion
    @ ~/.julia/packages/TidierData/Dw20l/src/TidierData.jl:424 [inlined]
 [11] top-level scope
    @ REPL[8]:1

Is there something I’m missing?

Overall, it’s really intuitive coming from R and a nice set of packages here.

Thanks in advance.

2 Likes

I’ll reply a bit later with a more complete response. The short answer is that we have updated documentation on how to use native Julia interpolation here, which has the added benefit of covering all edge cases: Interpolation - TidierData.jl

We need to update the README to match this guidance.

2 Likes

In the latest version of TidierData.jl, we now recommend using native Julia interpolation rather than the !! interpolation as was recommended in earlier versions. This is because native Julia interpolation handles all of the edge cases, which for various parsing reasons are hard to make work in all situations.

Native Julia interpolation in expressions usually works via simple prefixing with $. However, TidierData.jl relies on “non-standard evaluation,” which means that column names are bare and not symbols (i.e., we refer to the column as x rather than :x). So as a result, we can’t simply interpolate a symbol into an expression – we have to evaluate the symbol. The main takeaway is that in TidierData.jl, any expression that includes a $ also needs to be preceded with an @eval.

using Tidier

df = DataFrame(x = 1:5)
y = 5
df2 = DataFrame(y = 5)

@eval @filter(df, x != $y)
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4

This also works with df2.y.

@eval @filter(df, x != $df2.y)
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4

And if you’re using a chain, you need to prefix the entire chain with an @eval. For example:

@eval @chain df @filter(x != $df2.y)
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4

The @eval does make the whole expression a bit verbose, but that’s the cost of our using non-standard evaluation.

3 Likes

@kdpsingh thanks for the fast response!

Leveraging Julia’s interpolation sounds good.

I wonder if at one point you might consider a Tidier (like Tidier2) with Julia semantics - e.g., quoted variable names, $ for functions without @eval (if possible) etc. which would seem like a minor change - not in terms of adapting or past code, but in terms of learning curve for new Julia users.

In any case, coming from R many things have “just worked” and I’m very happy that I started learning Julia when Tidier was maturing.

1 Like

I thought a lot about this early on but ultimately decided to stick with non-standard eval (which I also call “tidy evaluation”) early on.

Using tidy eval makes it easy to do concise column selection, e.g., @select(df, 1:4, a:d) to select columns 1 through 4 and a through d, and this pattern works throughout the package. Had we gone with symbols, the a:d would end up looking uglier and longer to type (e.g., (:a):(:d), and even then the ::d might get misread as a data type. Since this is such a commonly used pattern, I felt this was a higher priority to get right than the issue of interpolation.

Using symbols doesn’t completely resolve ambiguity because symbols are a valid argument type in Julia. So if you call a function within a macro that uses a symbol as an argument rather than to refer to a column, you still have to implement a mechanism to “escape” the symbol (e.g., DFMeta uses ^ as a prefix to escape symbols).

I’m glad to hear “it just works.” I think that has been our North Star when developing Tidier.

1 Like

In this example, the following workaround seems to work as well:

julia> using Tidier

julia> df = DataFrame(x = 1:5);

julia> df2 = DataFrame(y = 5)
1×1 DataFrame
 Row │ y     
     │ Int64 
─────┼───────
   1 │     5

julia> @filter(df, x != !!esc(df2.y))
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4

Using @eval has the huge disadvantage that it can’t be fully compiled, i.e., requires compilation at runtime:

julia> using BenchmarkTools

julia> fun(df, y) = @filter(df, x != !!y)
fun (generic function with 1 method)

julia> gun(df, y) = @eval @filter(df, x != $y)
gun (generic function with 1 method)

julia> @btime fun($df, 4);
  55.023 μs (138 allocations: 6.43 KiB)

julia> @btime gun($df, 4);
  112.752 ms (120061 allocations: 8.31 MiB)

julia> @time fun(df, 4);
  0.000359 seconds (138 allocations: 6.430 KiB)

julia> @time gun(df, 4);
  0.162312 seconds (120.06 k allocations: 8.312 MiB, 91.04% compilation time)

# and again, i.e., compiles every time ...
julia> @time gun(df, 4);
  0.161877 seconds (120.06 k allocations: 8.312 MiB, 91.00% compilation time)
2 Likes

Thanks @bertschi. This is good motivation for us to take another look at tweaking our interpolation (or documentation) to handle more of these cases thorough our parsing engine before resorting to @eval.

Great point. fwiw, this also works without the !!. But I’d love to get the parsing right so that the !! also just works. I’m sure this is fixable.

julia> @filter(df, x != esc(df2.y))
4×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
1 Like

Ok fair point - the : symbol in ternary expressions gives an error when you don’t include spaces around it for ambiguity so it’s possible to include clauses like that. But you’ve obviously considered many more of these scenarios so I’m rolling with it. I just saw you included Julian conventions (following that of Makie) for the TidierPlots so thought you might be considering something similar for TidierData. But having too many conventions also becomes a problem to maintain and for users to choose.

1 Like