Iterating over (key, value) with grouped DataFrame

Not sure if I’m being dense here, but is there a simple way to iterate over a GroupedDataFrame, having the grouping levels available at each iteration? i.e. something like

df = DataFrame(a = [1,1,2,2], b = randn(4))
gdf = groupby(df, :a)

for (keys, subdf) in iterator_im_looking_for(gdf)
    println((keys, subdf))
end
# Desired output: (NamedTuple, SubDataFrame) pairs, or something similar
#
#((a=1,), 2×2 SubDataFrame
#│ Row │ a     │ b        │
#│     │ Int64 │ Float64  │
#├─────┼───────┼──────────┤
#│ 1   │ 1     │ 0.109089 │
#│ 2   │ 1     │ 0.107033 │)
#((a=2,), 2×2 SubDataFrame
#│ Row │ a     │ b        │
#│     │ Int64 │ Float64  │
#├─────┼───────┼──────────┤
#│ 1   │ 2     │ 1.29613  │
#│ 2   │ 2     │ -2.33027 │)

1 Like

Is this what you are looking for? https://github.com/JuliaData/DataFrames.jl/pull/1908.

Until that is released you can get grouping variables as a data frame using eg. collect(x -> first(x)[groupvars(parent(x))], gdf)

1 Like

Ha, that’s exactly what I’m looking for. Thanks!

This code doesn’t work, but this does:

[parent(gdf)[i, groupvars(gdf)] for i in gdf.starts]

Sorry, I was writing from my head and mixed up collect with combine. This is what works as an example:

select!(combine(first, gdf), groupvars(gdf))

The solution from the linked PR using gdf.starts works, but it is using an internal, undocumented starts field that is not guaranteed to be supported in the future.

4 Likes

Re-opening this thread because this does not appear to work anymore. This does however:

for (k,v) in zip(first.(keys(gdf)), gdf)

Perhaps there is a better / more succinct way of doing this?

@gobs, perhaps like this:

for (k,v) in pairs(gdf)
    println(k, v)
end
3 Likes

To elaborate on @rafael.guerra’s answer, you can also get the key value directly via destructuring:

for ((k,), v) in pairs(gdf)
    println(k => v)
end

or to get the key by name (with Julia 1.7):

for ((; a), v) in pairs(gdf)
    println(a => v)
end

The solution of @bkamins to get the list of keys and key values still works except that groupvars was renamed to groupcols:

julia> select!(combine(first, gdf), groupcols(gdf))
2×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2

but I think the modern way to do that is simply

julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (a = 1,)
 GroupKey: (a = 2,)

and you can do something like NamedTuple.(keys(gdf)) or DataFrame(keys(gdf)) if you want a more convenient data structure.

3 Likes

I think for a GroupDataFrame eachindex(gdf) also works.