DataFrames: Turning a column of ints into an array of ints per row dependent upon grouping of matching inputs

Hi all,

I want to turn this DataFrame (see first image)

df = DataFrame(Year = [1,2,3,1,2,3], Name=["Jack","Jack","Jack","Jill","Jill","Jill"], Gender=["M","M","M","F","F","F"], Cash=[10.0, 14.0, 12.0, 20.5, 21.5, 22.5], Grade=[1, 2, 3, 2, 4, 3])

into this DataFrame (see second image)

intended_df = DataFrame(Name = ["Jack", "Jill"], Gender=["M", "F"], Cash=[[10.0, 14.0, 12.0], [20.5, 21.5, 22.5]], Grade=[[1,2,3],[2,4,3]])

Basically, I’m taking the observed output values for the “Cash” and “Grade” column across years 1-3 on the same input columns (“Name” and “Gender”), and turning them into an array of values in order from years 1-3 such that I can drop the “Year” column from the DataFrame. Does anybody have a suggested command for me to do this please? In my actual DataFrame there are going to be more inputs columns and more output columns than the 2 (Name, Gender) and 2 (Cash, Grade) that I have there, but I imagine the principle will be the same still.

Thanks for your time, I hope I’ve explained the question in a sensible manner (should hopefully be a straightforward scenario). If you need any more information from me please feel free to comment and let me know.

What about this:

using DataFrames
df = DataFrame(Year=[1,2,3,1,2,3],
               Name=["Jack","Jack","Jack","Jill","Jill","Jill"],
               Gender=["M","M","M","F","F","F"],
               Cash=[10.0, 14.0, 12.0, 20.5, 21.5, 22.5],
               Grade=[1, 2, 3, 2, 4, 3])

combine(groupby(df, [:Name, :Gender]),
        [:Cash, :Grade] .=> Ref, renamecols=false)


2×4 DataFrame
 Row │ Name    Gender  Cash                Grade     
     │ String  String  SubArray…           SubArray… 
─────┼───────────────────────────────────────────────
   1 │ Jack    M       [10.0, 14.0, 12.0]  [1, 2, 3]
   2 │ Jill    F       [20.5, 21.5, 22.5]  [2, 4, 3]

(replace Ref with Ref∘copy if you want to have copies of the data in the new data frame, so that further changes won’t change df)

Explanation: groupby makes the groups that you want (one group for each row of result), and combine transforms each group into a single row: [:Cash, :Grade] .=> Ref means that each of the Cash and Grade column gets transformed by the Ref function. The Ref adds a small wrapper around the column so that it gets treated as a single value (to have the whole vector as a row value).

3 Likes

copy isn’t needed here because you can’t actually copy a single number, right? I.e.

julia> df1 = combine(groupby(df, [:Name, :Gender]),
               [:Cash, :Grade] .=> Ref, renamecols=false)
2×4 DataFrame
 Row │ Name    Gender  Cash                Grade
     │ String  String  SubArray…           SubArray…
─────┼───────────────────────────────────────────────
   1 │ Jack    M       [10.0, 14.0, 12.0]  [1, 2, 3]
   2 │ Jill    F       [20.5, 21.5, 22.5]  [2, 4, 3]

julia> df.Cash .= 1;

julia> df1
2×4 DataFrame
 Row │ Name    Gender  Cash                Grade
     │ String  String  SubArray…           SubArray…
─────┼───────────────────────────────────────────────
   1 │ Jack    M       [10.0, 14.0, 12.0]  [1, 2, 3]
   2 │ Jill    F       [20.5, 21.5, 22.5]  [2, 4, 3]

julia> df
6×5 DataFrame
 Row │ Year   Name    Gender  Cash   Grade
     │ Int64  String  String  Int64  Int64
─────┼─────────────────────────────────────
   1 │     1  Jack    M           1      1
   2 │     2  Jack    M           1      2
   3 │     3  Jack    M           1      3
   4 │     1  Jill    F           1      2
   5 │     2  Jill    F           1      4
   6 │     3  Jill    F           1      3

or do you mean in the more general case where Cash could be some mutable container?

No it is required (combine will call Ref or copy with the whole column as argument):

julia> df1 = combine(groupby(df, [:Name, :Gender]),
               [:Cash, :Grade] .=> Ref, renamecols=false)
2×4 DataFrame
 Row │ Name    Gender  Cash                Grade     
     │ String  String  SubArray…           SubArray… 
─────┼───────────────────────────────────────────────
   1 │ Jack    M       [10.0, 14.0, 12.0]  [1, 2, 3]
   2 │ Jill    F       [20.5, 21.5, 22.5]  [2, 4, 3]

julia> df.Cash[1] = 0
0

julia> df1
2×4 DataFrame
 Row │ Name    Gender  Cash                Grade     
     │ String  String  SubArray…           SubArray… 
─────┼───────────────────────────────────────────────
   1 │ Jack    M       [0.0, 14.0, 12.0]   [1, 2, 3]
   2 │ Jill    F       [20.5, 21.5, 22.5]  [2, 4, 3]

I think you got bitten by the special behavior of .= in DataFrames: it’s not in-place, it creates a new vector for the result (I also find it confusing).

1 Like

Hi,

Thank you both for your help. The solution with .=> Ref worked. I knew it’d be some combination of the groupby() and combine() function with something, so appreciate finding out that Ref was what I needed.

Thanks again.

Hi there,

Back on this again. If I wanted to revert the operation such that I get df from df1, what command could I use?

julia> flatten(df1, [:Cash, :Grade])
6×4 DataFrame
 Row │ Name    Gender  Cash     Grade
     │ String  String  Float64  Int64
─────┼────────────────────────────────
   1 │ Jack    M          10.0      1
   2 │ Jack    M          14.0      2
   3 │ Jack    M          12.0      3
   4 │ Jill    F          20.5      2
   5 │ Jill    F          21.5      4
   6 │ Jill    F          22.5      3
2 Likes