I’m a bit rusty on data table programming and I am getting stuck using the DataFrames.jl package when trying to applying a complex function to every row of a DataFrame. e.g., I would like to take the entirety of a row
, and use a bunch of different columns (imagine, all of them) in a complex calculation to calculate a new column.
I have tried:
transform!(groupby(data, :uniqueID), complex_calc)
and i think it does what i want, adding a new column called x1
but if i try to pass a name using:
transform!(groupby(data, :uniqueID), complex_calc => "newname")
it gives the error:
ArgumentError: invalid index: var"#complex_calc#254"{String}("wrapped_function_arg") => "newname" of type Pair{var"#complex_calc#254"{String}, String}
I have also tried:
transform(data, :, :, complex_calc)
which gives incorrect results
Probably you want:
transform(groupby(data, :uniqueID), AsTable(All()) => complex_calc => "newname")
if you want to work with all the columns.
However, note that in this case you can also just do:
[complex_calc(x) fo x in groupby(data, :uniqueID)]
(in which case the result will be a vector not a data frame, but sometimes you might prefer that.
4 Likes
There are a number of ways to do this depending on how you have constructed complex_calc
.
Some comments:
- I think the grouping isn’t doing anything for you if each group is a single row.
- The first element in the Pair should be the columns to pass to
complex_calc
. Then the second element is complex_calc
, and the third is the new column name.
- Remember that columns are passed as vectors to
complex_calc
. If you instead want the elements of the row passed as scalar arguments, then you need to wrap complex_calc
in ByRow
inside the transform.
AsTable
can be used to pass an entire row as one argument, but the columns of the table are still vectors unless you use ByRow
too.
using DataFrames
complex_calc1(x, y, z) = x * y + z
complex_calc2(row) = row.x * row.y + row.z
df = DataFrame(id = "Row " .* string.(1:3), x = 1:3, y = 4:6, z = 7:9)
transform!(df, Not(:id) => ByRow(complex_calc1) => "Positional Argument Method")
transform!(df, AsTable(:) => ByRow(complex_calc2) => "Row Method")
df."Direct Argument Method" = complex_calc1.(df.x, df.y, df.z)
df."Direct Row Method" = complex_calc2.(eachrow(df))
1 Like
A side-note is that the first three are going to be very fast. The last one will be slow.
1 Like
I got “unknown algorithm” errors when trying to use the 2nd and 4th methods. Any idea why?
Did you get that error running my Minimum Working Example (MWE) above?
I’m not familiar with that error. I imagine it is something inside your complex_calc
function which is not defined to take a NamedTuple
(Method 2) or DataFrameRow
(Method 4). Note that I defined two different complex_calc
functions and used a different transformation syntax depending on how I defined complex_calc
. If your existing complex_calc
works with Methods 1 and 3, then just used those. If you are set on using Method 2 or 4, you could add a new method complex_calc(row) = complex_calc(row.x, row.y, row.z)
to teach complex_calc
how to handle a single row argument.