Dataframe to nested tuple?

Id like to transform this:

DataFrame( GroupLetter = ['A','A','B','B'] , GroupID = [1,1,2,2], Col1 = [1,2,3,4], Col2 = [5,6,7,8] )

 | GroupLetter │ GroupID │ Col1  │ Col2  │
 │ 'A'         │ 1       │ 1     │ 5     │
 │ 'A'         │ 1       │ 2     │ 6     │
 │ 'B'         │ 2       │ 3     │ 7     │
 │ 'B'         │ 2       │ 4     │ 8     │

Into this:

data = ( 
    A = (GroupID = 1, DF = DataFrame(Col1 = [1,2], Col2 = [5,6]) ),
    B = (GroupID = 2, DF = DataFrame(Col1 = [3,4], Col2 = [7,8]) )
)

So I can write

data.A.GroupID 

and

data.A.DF

giving

│ Col1  │ Col2  │
│ 1     │ 5     │
│ 2     │ 6     │

Is there an easy way to do this?
And, can the nested tuple structure be written to and from a file ?

julia> data = DataFrame( GroupLetter = ['A','A','B','B'] , GroupID = [1,1,2,2], Col1 = [1,2,3,4], Col2 = [5,6,7,8] )
4×4 DataFrame
│ Row │ GroupLetter │ GroupID │ Col1  │ Col2  │
│     │ Char        │ Int64   │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1   │ 'A'         │ 1       │ 1     │ 5     │
│ 2   │ 'A'         │ 1       │ 2     │ 6     │
│ 3   │ 'B'         │ 2       │ 3     │ 7     │
│ 4   │ 'B'         │ 2       │ 4     │ 8     │

julia> groupby(data, ["GroupLetter", "GroupID"])
GroupedDataFrame with 2 groups based on keys: GroupLetter, GroupID
First Group (2 rows): GroupLetter = 'A', GroupID = 1
│ Row │ GroupLetter │ GroupID │ Col1  │ Col2  │
│     │ Char        │ Int64   │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1   │ 'A'         │ 1       │ 1     │ 5     │
│ 2   │ 'A'         │ 1       │ 2     │ 6     │
⋮
Last Group (2 rows): GroupLetter = 'B', GroupID = 2
│ Row │ GroupLetter │ GroupID │ Col1  │ Col2  │
│     │ Char        │ Int64   │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1   │ 'B'         │ 2       │ 3     │ 7     │
│ 2   │ 'B'         │ 2       │ 4     │ 8     │

julia> gdf[(GroupLetter = 'A', GroupID = 1)]
2×4 SubDataFrame
│ Row │ GroupLetter │ GroupID │ Col1  │ Col2  │
│     │ Char        │ Int64   │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1   │ 'A'         │ 1       │ 1     │ 5     │
│ 2   │ 'A'         │ 1       │ 2     │ 6     │

Not quite what you asked for but maybe close enough that you like it.

Edit: Remember when doing such a thing that piping (conveniently via Pipe.jl or Chain.jl) is always more performant concise.

I don’t mean to derail this but piping should not be any more or less performant than other ways of writing the same code; it’s just syntax.

3 Likes

Bad wording, sorry.

1 Like

or gdf[('A', 1)] if you want to avoid passing the grouping column names.

1 Like

Thank you both.
It seems if there is only one grouping column you need a comma in the tuple

gdf = groupby(data, :GroupLetter)

gdf[ ( :A, ) ]

Yeah, tuple(x) looks slightly less clumsy imo