Namedtuple as a single value

I wanted to report the following results and have comments on how to consider them.
The problem is not fictitious.
It derives from the usecase treated here, where, among other things, the error description is slightly different. Why?

using DataFrames


combine(groupby(df,:x), :y=>last)

df=DataFrame(x=rand(1:5,10),y=[(nt=rand(1:5),) for _ in 1:10])

combine(groupby(df,:x), :y=>last)

# julia> combine(groupby(df,:x), :y=>last)
# ERROR: ArgumentError: a single value or vector result is required (got NamedTuple{(:nt,), Tuple{Int64}})
julia> cdf=combine(groupby(df,:x), :y=>Ref∘ last)
4×2 DataFrame
 Row │ x      y_Ref_last       
     │ Int64  NamedTupl…
   1 │     1  (f1 = 5, f2 = 4)
   2 │     2  (f1 = 4, f2 = 4)
   3 │     3  (f1 = 3, f2 = 2)
   4 │     5  (f1 = 5, f2 = 3)

julia> combine(groupby(df,:x), :y=>last=>AsTable)
4×3 DataFrame
 Row │ x      f1     f2    
     │ Int64  Int64  Int64
   1 │     1      5      4
   2 │     2      4      4
   3 │     3      3      2
   4 │     5      5      3

Can you please give a specific question you have? All you present in the post above works as expected. (apart from he fact that you seem to have reported results of using a different df that you create)

If this error message is what is confusing you the reason for the error is the following. Your operation returns a named tuple, which is a multi-column result. It is allowed to return a multi column result only if AsTable or list of column names is passed as target columns names.

Now this question, and your previous question show the approach we take in DataFrames.jl (as opposed to R). We want to make sure that user gets a correct result. If something is ambiguous we throw an error. This is different to R, which tries to guess what user wanted in case of ambiguity. We chose the “safety first” approach, as it is preferred in production applications (when you do not want to silently get a wrong result).

ok. the df involved is the following

df=DataFrame(x=rand(1:5,10),y=[(f1=rand(1:5),f2=rand(1:5)) for _ in 1:10])

As with push I can insert a namedtuple as the value of a cell, I would have expected that the result of a particular function (last (array of namedtuple) in this case) inside combine would also be treated similarly.
I know of the other situations where the cell with a named tuple has to be “expanded”.
I don’t know if it is possible or even useful to make the two situations coexist

This is exactly the ambiguity :smile:.

As with push I can insert a namedtuple as the value of a cell

You cannot. If you push NamedTuple it will always get expanded to multiple columns. If you want to push NamedTuple to a single cell you would need to wrap it (e.g. in a vector).

This statement will get me confused
I refer to the following expressions which give the expected result.

df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
dfh=DataFrame(s=0,id=0,a=missings(NamedTuple,1),r=missings(NamedTuple,1), m=missings(NamedTuple,1))

push!(dfh,(s=1,id=1,a=copy(df[1,:]),r=missing, m=missing))
push!(dfh,(s=2,id=2,a=copy(df[2,:]),r=missing, m=missing))
push!(dfh,(s=3,id=3,a=copy(df[3,:]),r=missing, m=missing))

This is exactly what I state:

  • the (s=1,id=1,a=copy(df[1,:]),r=missing, m=missing) named tuple is always expanded to multiple columns as it is top-level (not wrapped in anything);
  • the copy(df[1,:]) named tuple is not expanded and treated as a single column because it is wrapped (in this case in another named tuple)