Namedtuple as a single value

rocco_sprmnt21 · November 4, 2022, 8:35am

I wanted to report the following results and have comments on how to consider them.
The problem is not fictitious.
It derives from the usecase treated here, where, among other things, the error description is slightly different. Why?

using DataFrames

df=DataFrame(x=rand(1:5,10),y=1:10)

combine(groupby(df,:x), :y=>last)

df=DataFrame(x=rand(1:5,10),y=[(nt=rand(1:5),) for _ in 1:10])

combine(groupby(df,:x), :y=>last)

# julia> combine(groupby(df,:x), :y=>last)
# ERROR: ArgumentError: a single value or vector result is required (got NamedTuple{(:nt,), Tuple{Int64}})
julia> cdf=combine(groupby(df,:x), :y=>Ref∘ last)
4×2 DataFrame
 Row │ x      y_Ref_last       
     │ Int64  NamedTupl…
─────┼─────────────────────────
   1 │     1  (f1 = 5, f2 = 4)
   2 │     2  (f1 = 4, f2 = 4)
   3 │     3  (f1 = 3, f2 = 2)
   4 │     5  (f1 = 5, f2 = 3)

julia> combine(groupby(df,:x), :y=>last=>AsTable)
4×3 DataFrame
 Row │ x      f1     f2    
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      5      4
   2 │     2      4      4
   3 │     3      3      2
   4 │     5      5      3

bkamins · November 4, 2022, 8:55am

Can you please give a specific question you have? All you present in the post above works as expected. (apart from he fact that you seem to have reported results of using a different df that you create)

bkamins · November 4, 2022, 9:00am

If this error message is what is confusing you the reason for the error is the following. Your operation returns a named tuple, which is a multi-column result. It is allowed to return a multi column result only if AsTable or list of column names is passed as target columns names.

Now this question, and your previous question show the approach we take in DataFrames.jl (as opposed to R). We want to make sure that user gets a correct result. If something is ambiguous we throw an error. This is different to R, which tries to guess what user wanted in case of ambiguity. We chose the “safety first” approach, as it is preferred in production applications (when you do not want to silently get a wrong result).

rocco_sprmnt21 · November 4, 2022, 9:30am

ok. the df involved is the following

df=DataFrame(x=rand(1:5,10),y=[(f1=rand(1:5),f2=rand(1:5)) for _ in 1:10])

As with push I can insert a namedtuple as the value of a cell, I would have expected that the result of a particular function (last (array of namedtuple) in this case) inside combine would also be treated similarly.
PS
I know of the other situations where the cell with a named tuple has to be “expanded”.
I don’t know if it is possible or even useful to make the two situations coexist

bkamins · November 4, 2022, 10:18am

This is exactly the ambiguity .

As with push I can insert a namedtuple as the value of a cell

You cannot. If you push NamedTuple it will always get expanded to multiple columns. If you want to push NamedTuple to a single cell you would need to wrap it (e.g. in a vector).

rocco_sprmnt21 · November 4, 2022, 10:23am

This statement will get me confused
I refer to the following expressions which give the expected result.


df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
dfh=DataFrame(s=0,id=0,a=missings(NamedTuple,1),r=missings(NamedTuple,1), m=missings(NamedTuple,1))


push!(dfh,(s=1,id=1,a=copy(df[1,:]),r=missing, m=missing))
push!(dfh,(s=2,id=2,a=copy(df[2,:]),r=missing, m=missing))
push!(dfh,(s=3,id=3,a=copy(df[3,:]),r=missing, m=missing))

bkamins · November 4, 2022, 10:39am

This is exactly what I state:

the (s=1,id=1,a=copy(df[1,:]),r=missing, m=missing) named tuple is always expanded to multiple columns as it is top-level (not wrapped in anything);
the copy(df[1,:]) named tuple is not expanded and treated as a single column because it is wrapped (in this case in another named tuple)

Topic		Replies	Views
Construct DataFrame From Uneven Named Tuples General Usage dataframes	18	1081	August 20, 2023
DataFrame by new columns containing arrays Data question	13	815	March 29, 2020
Requiring a NamedTuple instead of just a tuple to return multiple columns from a function for DataFrames.jl seem unintuitive. Discussion Data tuple , dataframes , tables	3	115	September 22, 2024
Expanding Named Tuples Data dataframes	13	734	June 5, 2021
Why I get 'RefValue{SubArray{Int64' and not "simply" 'SubArray{Int64' Data	13	810	January 11, 2021

Namedtuple as a single value

Related topics