Id like to transform this:
DataFrame( GroupLetter = ['A','A','B','B'] , GroupID = [1,1,2,2], Col1 = [1,2,3,4], Col2 = [5,6,7,8] )
| GroupLetter │ GroupID │ Col1 │ Col2 │
│ 'A' │ 1 │ 1 │ 5 │
│ 'A' │ 1 │ 2 │ 6 │
│ 'B' │ 2 │ 3 │ 7 │
│ 'B' │ 2 │ 4 │ 8 │
Into this:
data = (
A = (GroupID = 1, DF = DataFrame(Col1 = [1,2], Col2 = [5,6]) ),
B = (GroupID = 2, DF = DataFrame(Col1 = [3,4], Col2 = [7,8]) )
)
So I can write
data.A.GroupID
and
data.A.DF
giving
│ Col1 │ Col2 │
│ 1 │ 5 │
│ 2 │ 6 │
Is there an easy way to do this?
And, can the nested tuple structure be written to and from a file ?
julia> data = DataFrame( GroupLetter = ['A','A','B','B'] , GroupID = [1,1,2,2], Col1 = [1,2,3,4], Col2 = [5,6,7,8] )
4×4 DataFrame
│ Row │ GroupLetter │ GroupID │ Col1 │ Col2 │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1 │ 'A' │ 1 │ 1 │ 5 │
│ 2 │ 'A' │ 1 │ 2 │ 6 │
│ 3 │ 'B' │ 2 │ 3 │ 7 │
│ 4 │ 'B' │ 2 │ 4 │ 8 │
julia> groupby(data, ["GroupLetter", "GroupID"])
GroupedDataFrame with 2 groups based on keys: GroupLetter, GroupID
First Group (2 rows): GroupLetter = 'A', GroupID = 1
│ Row │ GroupLetter │ GroupID │ Col1 │ Col2 │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1 │ 'A' │ 1 │ 1 │ 5 │
│ 2 │ 'A' │ 1 │ 2 │ 6 │
⋮
Last Group (2 rows): GroupLetter = 'B', GroupID = 2
│ Row │ GroupLetter │ GroupID │ Col1 │ Col2 │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1 │ 'B' │ 2 │ 3 │ 7 │
│ 2 │ 'B' │ 2 │ 4 │ 8 │
julia> gdf[(GroupLetter = 'A', GroupID = 1)]
2×4 SubDataFrame
│ Row │ GroupLetter │ GroupID │ Col1 │ Col2 │
│ │ Char │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼─────────┼───────┼───────┤
│ 1 │ 'A' │ 1 │ 1 │ 5 │
│ 2 │ 'A' │ 1 │ 2 │ 6 │
Not quite what you asked for but maybe close enough that you like it.
Edit: Remember when doing such a thing that piping (conveniently via Pipe.jl or Chain.jl) is always more performant concise.
I don’t mean to derail this but piping should not be any more or less performant than other ways of writing the same code; it’s just syntax.
3 Likes
or gdf[('A', 1)]
if you want to avoid passing the grouping column names.
1 Like
Thank you both.
It seems if there is only one grouping column you need a comma in the tuple
gdf = groupby(data, :GroupLetter)
gdf[ ( :A, ) ]
Yeah, tuple(x)
looks slightly less clumsy imo