Transform in DataFrames

I’m struggling with transformations in DataFrames.

To make it simple, let’s start with a very basic transformation. Suppose that I want to add columns :b and :c in the following DataFrame:

df = DataFrame(a=repeat([1,2], outer=3), b=repeat([1,2,3], outer=2), c=1:6)

6×3 DataFrame

Row a b c
Int64 Int64 Int64
1 1 1 1
2 2 2 2
3 1 3 3
4 2 1 4
5 1 2 5
6 2 3 6
gdf=groupby(df, :a)
transform(gdf, [:b, :c] => ((b, c) -> b + c) => :b_plus_c)

6×4 DataFrame

Row a b c b_plus_c
Int64 Int64 Int64 Int64
1 1 1 1 2
2 2 2 2 4
3 1 3 3 6
4 2 1 4 5
5 1 2 5 7
6 2 3 6 9

Now, suppose that, instead of adding two columns, I want to multiply them:

transform(gdf, [:b, :c] => ((b, c) -> b * c) => :b_times_c)

Now I get:

MethodError: no method matching *(::SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false}, ::SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})

Closest candidates are:
*(::Any, ::Any, ::Any, ::Any…)
@ Base operators.jl:578
*(::AbstractVector, ::LinearAlgebra.AbstractRotation)
@ LinearAlgebra C:\Users\USUARIO\AppData\Local\Programs\Julia-1.9.2\share\julia\stdlib\v1.9\LinearAlgebra\src\givens.jl:19
*(::LinearAlgebra.Diagonal, ::AbstractVector)
@ LinearAlgebra C:\Users\USUARIO\AppData\Local\Programs\Julia-1.9.2\share\julia\stdlib\v1.9\LinearAlgebra\src\diagonal.jl:242

If instead of multiplying, I want to calculate the minimum of :b and :c:

transform(gdf, [:b, :c] => ((b, c) -> minimum(b,c)) => :minimum_b_c)

… I get…

MethodError: objects of type SubArray{Int64, 1, Vector{Int64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false} are not callable
Use square brackets for indexing an Array.

My conclusion is that transform can only be used to add or subtract two columns, but it is unable to multiply or divide them.
Perhaps I am doing something wrong, but I cannot figure out what it is.

Well, the function passed to transform gets handed the full column vectors of your data frame, i.e., in your example b and c will be bound to vectors when executing ((b, c) -> <do something on b and c>). Thus, your function should be designed to operate on vectors and return a vector containing the result of your transformation.

  1. b + c works, because + happens to be defined for vector (as these can be considered as vector spaces where addition makes sense).

  2. b * c is not defined on vectors as there is no unambiguous definition on (finite) vector spaces. To apply * element-wise use explicit broadcasting, i.e., write b .* c

  3. minimum usually takes a single argument and reduces its argument to a single minimal value. To compute the elementwise min, you need to broadcast the min function, i.e., write min.(b, c).

Thanks, @bertschi
I have checked your solution on my PC.

b .* c works for me.
but min.(b, c) does not

I get the following error:

transform(gdf, [:b, :c] => ((b, c) -> min.(b, c)) => :minimum_b_c)

MethodError: objects of type Tuple{Int64, Int64, Int64} are not callable

Hmm, it does work for me (and I cannot reproduce the exact error when trying some variants).

I’m running Julia Ver 1.9.2 and DataFrames Ver 1.6.1

Could this explain why it is not working for me?

Maybe, but broadcasting is very deeply engrained into Julia and should work in any version.
Can you post the precise interaction, you typed into the REPL, including the code line and the full stacktrace?

You need to look at the ByRow function.

using DataFrames

df = DataFrame(a=repeat([1,2], outer=3), b=repeat([1,2,3], outer=2), c=1:6)

transform(df,["b","c"]=>ByRow(+))

transform(df,["b","c"]=>ByRow(*))

transform(df,["b","c"]=>ByRow(min))
2 Likes

Probably easiest is DataFramesMeta:

using DataFrames, DataFramesMeta;

df = DataFrame(a=repeat([1,2], outer=3), b=repeat([1,2,3], outer=2), c=1:6);

gdf = groupby(df, :a);

@rtransform(gdf, :bcmin = min(:b, :c))
3 Likes

Have you accidentally done something like this?

# julia> transform(gdf, [:b, :c] => ((b, c) -> (c...,)(b)) => :minimum_b_c)
# ERROR: MethodError: objects of type Tuple{Int64, Int64, Int64} are not callable

In case you need it, know that a function that transforms your input vectors into a scalar is fine too.

transform(gdf, [:b, :c] => ((b, c) -> b'*c) => :b_dot_c)
1 Like

In case you’re just starting with Julia DataFrames, I think it’s important to distinguish between the operations you’re trying to perform.

When you’re operating with pairs of columns, like with +, you don’t need to group the data or apply transform. You should simply do:


dff = DataFrame(a=repeat([1,2], outer=3), b=repeat([1,2,3], outer=2), c=1:6)

dff.b_plus_c = dff.b + dff.c

The same holds for min, if what you aimed for was to keep the minimum between pairs of two columns

dff.b_plus_c = min.(dff.b, dff.c)

Instead, use transform when you want to operate with whole columns rather than pairs of columns.

In that case, you could probably want to define what a whole vector should be considered, in the sense of only taking the whole vector by groups.

For example, if you want to take the minimum of c for each specific group of a and return the result, then you need:

gdf = groupby(dff, :a)

transform(gdf, :c => minimum => :min_c)

transform(gdf, :c => (a -> minimum(a)) => :min_c) #equivalent, what you're doing with the previous operation

Also, use transform! if you want to update the original dff, rather than creating a new dataframe. Otherwise, all the results performed with transform will be lost.

1 Like

If this syntax is more familiar, the TidierData.jl package makes the broadcasting invisible:

using TidierData
@chain df begin
  @mutate(b_plus_c = b + c, b_times_c = b * c, bc_min = min(b,c))
end
2 Likes

Here is what I tried, as per your suggestion:

df = DataFrame(a=repeat([1,2], outer=3), b=repeat([1,2,3], outer=2), c=1:6)
gdf=groupby(df, :a)
transform(gdf, [:b, :c] => ((b, c) -> min.(b, c)) => :min_b_c)

And this this what I’m getting:

MethodError: objects of type Tuple{Int64, Int64, Int64} are not callable

Stacktrace:
[1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:755
[2] _combine_prepare_norm(gd::GroupedDataFrame{DataFrame}, cs_vec::Vector{Any}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:87
[3] _combine_prepare(gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:52
[4] _combine_prepare
@ C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:26 [inlined]
[5] select(::GroupedDataFrame{DataFrame}, ::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}, ::Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}}; copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:892
[6] transform(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat}; copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:917
[7] transform(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat})
@ DataFrames C:\Users\USUARIO.julia\packages\DataFrames\58MUJ\src\groupeddataframe\splitapplycombine.jl:912
[8] top-level scope
@ In[202]:3

That code works fine for me. There must be some other code you are running thats broken.

Bumping the DataFramesMeta.jl solution though (as the mantainer of the package). It’s pretty easy to use imo.

Thanks for posting the details, unfortunately the example runs fine for me as well.
The only way, I can reproduce this error is by shadowing min, i.e.,

julia> let min = (1,2,3)
           transform(gdf, [:b, :c] => ((b, c) -> min.(b, c)) => :min_b_c)
       end
ERROR: MethodError: objects of type Tuple{Int64, Int64, Int64} are not callable

Are you running this line inside a function with an argument min or otherwise managed to redefine it?

1 Like