Evaluate a formula inside a dataframes to create a new column

I have a dataset that includes a column which has a formula.
The formula is a recipe to get a new column from existing columns.

Essentially the data looks like the following

> df = DataFrame(l=1:8, a=[1, 1, 1, 2, 2, 1, 1, 2],
   b=repeat([2, 1], outer=[4]),c=1:8,
   formula = [repeat([":b+:c"], 3); repeat([":a+:c"], 5)])
   
8×5 DataFrame
 Row │ l      a      b      c      formula
     │ Int64  Int64  Int64  Int64  String
─────┼─────────────────────────────────────
   1 │     1      1      2      1  :b+:c
   2 │     2      1      1      2  :b+:c
   3 │     3      1      2      3  :b+:c
   4 │     4      2      1      4  :a+:c
   5 │     5      2      2      5  :a+:c
   6 │     6      1      1      6  :a+:c
   7 │     7      1      2      7  :a+:c
   8 │     8      2      1      8  :a+:c

I am trying to create a new column which would have either the value a b+c or a+c as given by the column named formula.
This column is a string and depending on preprocessing looks like a+b or :a+:b (could also be :a+:b-:c).

The result I am looking for is something like:

df = DataFrame(l=1:8, a=[1, 1, 1, 2, 2, 1, 1, 2],
           b=repeat([2, 1], outer=[4]),c=1:8,
           formula = [repeat([":b+:c"], 3); repeat([":a+:c"], 5)],
           value = [3,3,5,6,7,7,8,10])
8×6 DataFrame
 Row │ l      a      b      c      formula  value
     │ Int64  Int64  Int64  Int64  String   Int64
─────┼────────────────────────────────────────────
   1 │     1      1      2      1  :b+:c        3
   2 │     2      1      1      2  :b+:c        3
   3 │     3      1      2      3  :b+:c        5
   4 │     4      2      1      4  :a+:c        6
   5 │     5      2      2      5  :a+:c        7
   6 │     6      1      1      6  :a+:c        7
   7 │     7      1      2      7  :a+:c        8
   8 │     8      2      1      8  :a+:c       10

I have tried to use metaprogramming after splitting the table groupby(df, :formula), but I run into the issue that the scope of eval is global, so the formula and the subdataframes are not properly evaluated.

This is related to this SO post.

Thanks for the help!

ifelse.(dr.formula.== Symbol("a+b"), df.a .+ df.b, df.a .+ df.c)

Thanks!
Technically this could work but I am dealing with something like 50 different formulas which is why I am trying to use macros.

I don’t understand how macros could be of help here, the formulas are values but macros operate on expressions?

Do the formulas have some structure or could they be anything?

The formulas use existing column names represented as Strings or Symbols as the operands and simple binary operators (+, -,etc), as I understand it.

1 Like

The equivalent in R using data.table would be to do:

R> dt = data.table(l=1:8, a=c(1, 1, 1, 2, 2, 1, 1, 2),
   b=rep(c(2, 1), 4), c=1:8,
   formula = c(rep("b+c", 3), rep("a+c", 5)))
   
R > dt[, value:=eval(parse(text=formula)), by = .(l)]   

where the formula is parsed and then evaluated.

My initial idea was to group the data.frame by formulas, parse the formula and create a macro to create the new value, but alas the subdataframe could not be evaluated in the macro because it was in local scope.

As your formula depends on runtime information, you will need to use eval one way or the other. Instead of trying to evaluate the formula code directly – which would happen in the global environment – you can build a function and call that later on. Here is a simple example (assuming that you know which data frame columns to use before hand):

using DataFrames, MacroTools
df = ...  # copy your example here
function makeFun(formula)
    expr = MacroTools.postwalk(e -> if e isa QuoteNode; e.value else e end, Meta.parse(formula))
    eval(:((a, b, c) -> $expr))
end
# Now use as follows
funs = makeFun.(df.formula)
zip(df.a, df.b, df.c) .|> splat.(funs)

Obviously, you can extend the approach by further meta-programming to take the arguments from a given data frame …

2 Likes

Thanks!
I think this works. I have adjusted it so that I don’t have to reference the arguments directly and can select specific columns that show up in the formulas.

using DataFrames, MacroTools
df = DataFrame(l=1:8, a=[1, 1, 1, 2, 2, 1, 1, 2],
   b=repeat([2, 1], outer=[4]),c=1:8,
   formula = [repeat([":b+:c"], 3); repeat([":a+:c"], 5)])

# some form of column selector that includes at least the columns that will show up in the formulas
col_arguments_idx = 1:4; 

function makeFun(formula)
    expr = MacroTools.postwalk(e -> if e isa QuoteNode; e.value else e end, Meta.parse(formula))
    eval(Meta.parse("(" * join(names(df)[col_arguments_idx], ",") * ") -> $expr"))
end

funs = makeFun.(df.formula)
df.value .= eachrow(df[:, 1:4]) .|> Tuple .|> splat.(funs);

I wonder if this task can be done similarly to what DataFrameMacros uses, when using expressions like
@transform(df, :d=:a+:b) to create new columns.
If you want to use different expressions, you can proceed to a partition first, using the groupby() function.

Would a brutal shortcut like this be acceptable for your cases?

function makeFun1(formula)
    expr = replace(formula, ':'=>"")
    eval(Meta.parse("(" * join(names(df)[col_arguments_idx], ",") * ") -> $expr"))
end

Good that you could make it work. Just a few remarks:

  1. Please don’t use string manipulation for constructing expressions, Julia has dedicated data types for this purpose. Here is a classic why meta-programming should not be based on strings: Why Lisp macros are cool, a Perl perspective

  2. Don’t have makeFun depend on global state, better pass all required information explicitly.

Here is how I would extend the function:

function makeFun(argnames, formula)
    expr = MacroTools.postwalk(e -> if e isa QuoteNode; e.value else e end,
                               Meta.parse(formula))
    args = Expr(:tuple, Symbol.(argnames)...)
    eval(:($args -> $expr))
end

col_arguments_idx = 1:4
funs = makeFun.(Ref(names(df)[col_arguments_idx]), df.formula)
eachrow(df[:, col_arguments_idx]) .|> Tuple .|> splat.(funs);
1 Like

A simple way:

df.out = [eval(Meta.parse(replace(df.formula[i], ":"=>"df[$i,:]."))) for i in 1:nrow(df)]

Yes, but again manipulating strings is brittle:

df = DataFrame(a = 1:2, b = 3:4, formula = "sum(:a:(:b+1))")
[eval(Meta.parse(replace(df.formula[i], ":"=>"df[$i,:]."))) for i in 1:nrow(df)]
[eval(MacroTools.postwalk(e -> if e isa QuoteNode; :(getindex(df, $i, $e)) else e end, Meta.parse(df.formula[i]))) for i in 1:nrow(df)]

Yes, they are suitable for simple minds like mine :sweat:

Thank you.

I guess this is a little mysterious to me.
I am wondering how I can learn/understand this code to reapply it in different context. I find the macro documentation not particularly enlightening.

TBH, I learned macros in Lisp where they are slightly easier as the data structures involved are simpler, i.e., reducing the impedance mismatch between the code you see and its internal representation. If you want to go this route, On Lisp is still a great read.
Anyways, let’s start with the famous REPL, i.e., read-eval-print-loop which quite accurately describes what is happening when you interact with Julia:

  1. Your code is read from the terminal (as text) and parsed into an internal data structure
  2. This data (representing your code) is then evaluated – which in Julia involves compiling and then executing the code
  3. The result is printed to you on the terminal, i.e., converted into a textual representation again.
  4. Now loop, i.e., go back to step 1

A macro is basically a function that can act between steps 1 and 2, i.e., it can transform the data structure representing code before it is passed for evaluation. Let’s have a closer look at this using the Julia REPL itself:

julia> txt = "a = b + 1"  # code as text
"a = b + 1"
julia> expr = Meta.parse(txt)  # parse into data
:(a = b + 1)
julia> Meta.dump(expr)  # Look at data in more detail => its a tree containing Expr nodes, Symbols, Numbers etc
Expr
  head: Symbol call
  args: Array{Any}((3,))
    1: Symbol +
    2: Symbol a
    3: Expr
      head: Symbol call
      args: Array{Any}((3,))
        1: Symbol *
        2: Symbol b
        3: Int64 2
julia> using MacroTools
julia> MacroTools.postwalk(e -> @show(e), expr)  # walk that tree in post-order
e = :a
e = :+
e = :b
e = 1
e = :(b + 1)
e = :(a = b + 1)
:(a = b + 1)
# Now let's write a simple transformation which inserts code for printing whenever hitting a call expression
julia> function _tracecall(e)
           if e isa Expr && e.head == :call
               Expr(:block,
                    Expr(:println, "Calling ", string(e)),
                    e)
           else
               e
           end
       end
julia> _tracecall(expr)  # not a call expr
:(a = b + 1)
julia> _tracecall(expr.args[2])  # but this one is
quote
    println("Calling ", "b + 1")
    b + 1
end
julia> MacroTools.postwalk(_tracecall, expr)  # Transform every call in the tree
:(a = begin
          println("Calling ", "b + 1")
          b + 1
      end)
# Turn this into a macro
julia> macro tracecalls(e) MacroTools.postwalk(_tracecall, e) end
@tracecalls (macro with 1 method)
julia> @tracecalls a = b + 1
Calling b + 1
11

Now, fix this macro to work properly on multiple nested calls and simplify the construction of the expression using Interpolation

2 Likes