Complex DataFrames.jl transform by row

I’m a bit rusty on data table programming and I am getting stuck using the DataFrames.jl package when trying to applying a complex function to every row of a DataFrame. e.g., I would like to take the entirety of a row, and use a bunch of different columns (imagine, all of them) in a complex calculation to calculate a new column.
I have tried:
transform!(groupby(data, :uniqueID), complex_calc) and i think it does what i want, adding a new column called x1 but if i try to pass a name using:
transform!(groupby(data, :uniqueID), complex_calc => "newname") it gives the error:

ArgumentError: invalid index: var"#complex_calc#254"{String}("wrapped_function_arg") => "newname" of type Pair{var"#complex_calc#254"{String}, String}

I have also tried:
transform(data, :, :, complex_calc) which gives incorrect results

Probably you want:

transform(groupby(data, :uniqueID), AsTable(All()) => complex_calc => "newname")

if you want to work with all the columns.

However, note that in this case you can also just do:

[complex_calc(x) fo x in groupby(data, :uniqueID)]

(in which case the result will be a vector not a data frame, but sometimes you might prefer that.


There are a number of ways to do this depending on how you have constructed complex_calc.

Some comments:

  1. I think the grouping isn’t doing anything for you if each group is a single row.
  2. The first element in the Pair should be the columns to pass to complex_calc. Then the second element is complex_calc, and the third is the new column name.
  3. Remember that columns are passed as vectors to complex_calc. If you instead want the elements of the row passed as scalar arguments, then you need to wrap complex_calc in ByRow inside the transform.
  4. AsTable can be used to pass an entire row as one argument, but the columns of the table are still vectors unless you use ByRow too.
using DataFrames

complex_calc1(x, y, z) = x * y + z
complex_calc2(row) = row.x * row.y + row.z

df = DataFrame(id = "Row " .* string.(1:3), x = 1:3, y = 4:6, z = 7:9)

transform!(df, Not(:id) => ByRow(complex_calc1) => "Positional Argument Method")
transform!(df, AsTable(:) => ByRow(complex_calc2) => "Row Method")
df."Direct Argument Method" = complex_calc1.(df.x, df.y, df.z)
df."Direct Row Method" = complex_calc2.(eachrow(df))
1 Like

A side-note is that the first three are going to be very fast. The last one will be slow.

1 Like

I got “unknown algorithm” errors when trying to use the 2nd and 4th methods. :confused: Any idea why?

Did you get that error running my Minimum Working Example (MWE) above?

I’m not familiar with that error. I imagine it is something inside your complex_calc function which is not defined to take a NamedTuple (Method 2) or DataFrameRow (Method 4). Note that I defined two different complex_calc functions and used a different transformation syntax depending on how I defined complex_calc. If your existing complex_calc works with Methods 1 and 3, then just used those. If you are set on using Method 2 or 4, you could add a new method complex_calc(row) = complex_calc(row.x, row.y, row.z) to teach complex_calc how to handle a single row argument.