Groupby and aggregate a dataframe with custom function that return a vector

Hi,

It is my understanding that I can have a dataframe column with array/vector. Indeed I can do

julia> df = DataFrame(A = ["M", "F"], B = [[1,4], [2,3]])
2×2 DataFrame
 Row │ A       B      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

Assuming my initial dataframe is

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrame
 Row │ A      B      
     │ Int64  String 
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M

and I have this custom function

function myfun(p)
    result = Vector{eltype(p)}()
    for i = 1:length(p)
        push!(result, p[i])
    end
    println(result) #just to see the result
    result
end

why this does not work?

julia> combine(groupby(df, :B), [:A] => ((a) -> (a=myfun(a))) => AsTable)
[1, 4]
[2, 3]
4×2 DataFrame        
 Row │ B       x1    
     │ String  Int64 
─────┼───────────────
   1 │ M           1 
   2 │ M           4 
   3 │ F           2 
   4 │ F           3 

DataFrames has a function flatten which will help here.

julia> using DataFrames

julia> df = DataFrame(A = ["M", "F"], B = [[1,4], [2,3]])
2×2 DataFrame
 Row │ A       B      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

julia> flatten(df, :B)
4×2 DataFrame
 Row │ A       B     
     │ String  Int64 
─────┼───────────────
   1 │ M           1
   2 │ M           4
   3 │ F           2
   4 │ F           3
1 Like

It’s not clear what @Pisistrato means by “not working” but I think the desired output is a vector for each group.

The problem with combine(groupby(df, :B), [:A] => ((a) -> (a=myfun(a))) => AsTable) is that (a=...) is a simple assignment, not a named tuple so you need to write (a=...,), and the second problem is that returning a vector means returning a column. But you want a vector for each row in the result! So you need to return a vector of vectors :slight_smile:

A simple adaptation of the original code is:

combine(groupby(df, :B), [:A] => ((a) -> (a=[myfun(a)],)) => AsTable)

which can be simplified to

combine(groupby(df, :B), :A => Base.vect∘myfun => :a)

2×2 DataFrame
 Row │ B       a      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

Edit (thanks @pdeffebach): one can also “protect” the result with Ref:

combine(groupby(df, :B), :A => Ref∘myfun => :a)

Oh, sorry for not reading carefully. OP needs to use Ref. Here is a solution

julia> df
4×2 DataFrame
 Row │ A       B     
     │ String  Int64 
─────┼───────────────
   1 │ M           1
   2 │ M           4
   3 │ F           2
   4 │ F           3

julia> combine(groupby(df, :A), :B => Ref => :B)
2×2 DataFrame
 Row │ A       B         
     │ String  SubArray… 
─────┼───────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

julia> using DataFramesMeta;

julia> @by df :A (:B = Ref(:B))
2×2 DataFrame
 Row │ A       B         
     │ String  SubArray… 
─────┼───────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]
2 Likes

Would you mind if you dropped a few lines explaining what Ref() does in that case or simply redirect us to a place with similar examples and an explanation? The manual is definitely clear to CS people but not to me… Based on the manual definition below, I cannot see the connection. I seem to always miss the point with Ref().

Ref{T}
An object that safely references data of type T. This type is guaranteed to point to valid, Julia-allocated memory of the correct type. The underlying data is protected from freeing by the garbage collector as long as the Ref itself is referenced.

Following @Erik_Engheim 's post here, Ref() is creating a pointer-like object, if I understood right.

So, there is definitely something I am missing here, since I don’t see why the :B => Ref => :B works. I find it pretty often in code repos and it is a saver in many situations, but if I don’t understand how it works, I will not be able to use it. Any hint is more than welcome!

I want to emphasize, the vast majority of people on this website are not CS people. We are mostly scientists / data scientists with amateur computer science knowledge.

Further down in the docs give a (short, unsatisfying) answer

Use in broadcasting

Ref is sometimes used in broadcasting in order to treat the referenced values as a scalar.

Basically DataFrames.jl tries to “expand” vectors to fill the full column in combine. Ref stops that. It’s in the docs here, again not a very complete explanation

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0 -dimensional AbstractArray are unwrapped and then treated as a single row.

Ref is basically a confusing way to treat an object as a scalar during broadcasting. It might be nice if Base Julia had a better name for this, like Scalar. Another alternative which might be more intuitive is to wrap a collection in a length-one tuple in order to treat that collection as a scalar:

julia> [1, 2] .∈ Ref([1, 3, 5])
2-element BitVector:
 1
 0

julia> [1, 2] .∈ ([1, 3, 5], )
2-element BitVector:
 1
 0

There is some discussion of the Ref trick in the documentation for in.

1 Like

Thanks for the answer!

It is relieving to hear that there are many people with similar CS background.

So, in other words, if I understood it right, in the case of :B => Ref => :B what Ref does is that it takes the column :B and keeps its contents as an array for every group of :A. In that sense, in the context of broadcasting, Ref functions as a kind of “delaying” mechanism.

Thanks again!

Thank you for the example and the short explanation @CameronBieganek . It definitely helped! I will also visit the discussion.