Iterating over (key, value) with grouped DataFrame

ElOceanografo · August 9, 2019, 7:37pm

Not sure if I’m being dense here, but is there a simple way to iterate over a GroupedDataFrame, having the grouping levels available at each iteration? i.e. something like

df = DataFrame(a = [1,1,2,2], b = randn(4))
gdf = groupby(df, :a)

for (keys, subdf) in iterator_im_looking_for(gdf)
    println((keys, subdf))
end
# Desired output: (NamedTuple, SubDataFrame) pairs, or something similar
#
#((a=1,), 2×2 SubDataFrame
#│ Row │ a     │ b        │
#│     │ Int64 │ Float64  │
#├─────┼───────┼──────────┤
#│ 1   │ 1     │ 0.109089 │
#│ 2   │ 1     │ 0.107033 │)
#((a=2,), 2×2 SubDataFrame
#│ Row │ a     │ b        │
#│     │ Int64 │ Float64  │
#├─────┼───────┼──────────┤
#│ 1   │ 2     │ 1.29613  │
#│ 2   │ 2     │ -2.33027 │)

bkamins · August 9, 2019, 7:52pm

Is this what you are looking for? https://github.com/JuliaData/DataFrames.jl/pull/1908.

Until that is released you can get grouping variables as a data frame using eg. collect(x -> first(x)[groupvars(parent(x))], gdf)

ElOceanografo · August 9, 2019, 8:14pm

Ha, that’s exactly what I’m looking for. Thanks!

ElOceanografo · August 9, 2019, 8:29pm

This code doesn’t work, but this does:

[parent(gdf)[i, groupvars(gdf)] for i in gdf.starts]

bkamins · August 9, 2019, 9:08pm

Sorry, I was writing from my head and mixed up collect with combine. This is what works as an example:

select!(combine(first, gdf), groupvars(gdf))

The solution from the linked PR using gdf.starts works, but it is using an internal, undocumented starts field that is not guaranteed to be supported in the future.

gobs · January 11, 2022, 10:38am

Re-opening this thread because this does not appear to work anymore. This does however:

for (k,v) in zip(first.(keys(gdf)), gdf)

Perhaps there is a better / more succinct way of doing this?

rafael.guerra · January 11, 2022, 11:53am

@gobs, perhaps like this:

for (k,v) in pairs(gdf)
    println(k, v)
end

sijo · January 11, 2022, 12:13pm

To elaborate on @rafael.guerra’s answer, you can also get the key value directly via destructuring:

for ((k,), v) in pairs(gdf)
    println(k => v)
end

or to get the key by name (with Julia 1.7):

for ((; a), v) in pairs(gdf)
    println(a => v)
end

The solution of @bkamins to get the list of keys and key values still works except that groupvars was renamed to groupcols:

julia> select!(combine(first, gdf), groupcols(gdf))
2×1 DataFrame
 Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2

but I think the modern way to do that is simply

julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (a = 1,)
 GroupKey: (a = 2,)

and you can do something like NamedTuple.(keys(gdf)) or DataFrame(keys(gdf)) if you want a more convenient data structure.

phantom · October 18, 2022, 9:36pm

I think for a GroupDataFrame eachindex(gdf) also works.

Topic		Replies	Views
Iteration over two grouped data frames New to Julia question , dataframes , iterators	1	766	April 27, 2022
DataFrame grouped by a column; How to access a group by a particular value in that column General Usage dataframes	1	2672	January 5, 2022
DataFrame how to `groupby` then index with unspecified keys (merge them) General Usage question , dataframes	12	513	November 15, 2022
Why can't I access a GroupedDataFrame (via get function) through a GroupKey? New to Julia question , dataframes	13	581	August 27, 2021
Create variable name in dataframe that is groupby key Data	1	479	November 22, 2020

Iterating over (key, value) with grouped DataFrame

Related topics