Groupby and aggregate a dataframe with custom function that return a vector

Hi,

It is my understanding that I can have a dataframe column with array/vector. Indeed I can do

julia> df = DataFrame(A = ["M", "F"], B = [[1,4], [2,3]])
2×2 DataFrame
 Row │ A       B      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

Assuming my initial dataframe is

julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrame
 Row │ A      B      
     │ Int64  String 
─────┼───────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M

and I have this custom function

function myfun(p)
    result = Vector{eltype(p)}()
    for i = 1:length(p)
        push!(result, p[i])
    end
    println(result) #just to see the result
    result
end

why this does not work?

julia> combine(groupby(df, :B), [:A] => ((a) -> (a=myfun(a))) => AsTable)
[1, 4]
[2, 3]
4×2 DataFrame        
 Row │ B       x1    
     │ String  Int64 
─────┼───────────────
   1 │ M           1 
   2 │ M           4 
   3 │ F           2 
   4 │ F           3 

DataFrames has a function flatten which will help here.

julia> using DataFrames

julia> df = DataFrame(A = ["M", "F"], B = [[1,4], [2,3]])
2×2 DataFrame
 Row │ A       B      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

julia> flatten(df, :B)
4×2 DataFrame
 Row │ A       B     
     │ String  Int64 
─────┼───────────────
   1 │ M           1
   2 │ M           4
   3 │ F           2
   4 │ F           3

It’s not clear what @Pisistrato means by “not working” but I think the desired output is a vector for each group.

The problem with combine(groupby(df, :B), [:A] => ((a) -> (a=myfun(a))) => AsTable) is that (a=...) is a simple assignment, not a named tuple so you need to write (a=...,), and the second problem is that returning a vector means returning a column. But you want a vector for each row in the result! So you need to return a vector of vectors :slight_smile:

A simple adaptation of the original code is:

combine(groupby(df, :B), [:A] => ((a) -> (a=[myfun(a)],)) => AsTable)

which can be simplified to

combine(groupby(df, :B), :A => Base.vect∘myfun => :a)

2×2 DataFrame
 Row │ B       a      
     │ String  Array… 
─────┼────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

Edit (thanks @pdeffebach): one can also “protect” the result with Ref:

combine(groupby(df, :B), :A => Ref∘myfun => :a)

Oh, sorry for not reading carefully. OP needs to use Ref. Here is a solution

julia> df
4×2 DataFrame
 Row │ A       B     
     │ String  Int64 
─────┼───────────────
   1 │ M           1
   2 │ M           4
   3 │ F           2
   4 │ F           3

julia> combine(groupby(df, :A), :B => Ref => :B)
2×2 DataFrame
 Row │ A       B         
     │ String  SubArray… 
─────┼───────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

julia> using DataFramesMeta;

julia> @by df :A (:B = Ref(:B))
2×2 DataFrame
 Row │ A       B         
     │ String  SubArray… 
─────┼───────────────────
   1 │ M       [1, 4]
   2 │ F       [2, 3]

Would you mind if you dropped a few lines explaining what Ref() does in that case or simply redirect us to a place with similar examples and an explanation? The manual is definitely clear to CS people but not to me… Based on the manual definition below, I cannot see the connection. I seem to always miss the point with Ref().

Ref{T}
An object that safely references data of type T. This type is guaranteed to point to valid, Julia-allocated memory of the correct type. The underlying data is protected from freeing by the garbage collector as long as the Ref itself is referenced.

Following @Erik_Engheim 's post here, Ref() is creating a pointer-like object, if I understood right.

So, there is definitely something I am missing here, since I don’t see why the :B => Ref => :B works. I find it pretty often in code repos and it is a saver in many situations, but if I don’t understand how it works, I will not be able to use it. Any hint is more than welcome!

I want to emphasize, the vast majority of people on this website are not CS people. We are mostly scientists / data scientists with amateur computer science knowledge.

Further down in the docs give a (short, unsatisfying) answer

Use in broadcasting

Ref is sometimes used in broadcasting in order to treat the referenced values as a scalar.

Basically DataFrames.jl tries to “expand” vectors to fill the full column in combine. Ref stops that. It’s in the docs here, again not a very complete explanation

In all of these cases, function can return either a single row or multiple rows. As a particular rule, values wrapped in a Ref or a 0 -dimensional AbstractArray are unwrapped and then treated as a single row.

Ref is basically a confusing way to treat an object as a scalar during broadcasting. It might be nice if Base Julia had a better name for this, like Scalar. Another alternative which might be more intuitive is to wrap a collection in a length-one tuple in order to treat that collection as a scalar:

julia> [1, 2] .∈ Ref([1, 3, 5])
2-element BitVector:
 1
 0

julia> [1, 2] .∈ ([1, 3, 5], )
2-element BitVector:
 1
 0

There is some discussion of the Ref trick in the documentation for in.

Thanks for the answer!

It is relieving to hear that there are many people with similar CS background.

So, in other words, if I understood it right, in the case of :B => Ref => :B what Ref does is that it takes the column :B and keeps its contents as an array for every group of :A. In that sense, in the context of broadcasting, Ref functions as a kind of “delaying” mechanism.

Thanks again!

Thank you for the example and the short explanation @CameronBieganek . It definitely helped! I will also visit the discussion.