DataFrame how to `groupby` then index with unspecified keys (merge them)

jling · November 14, 2022, 5:06pm

gdf = groupby(df, [:gender, :nationality, :hair_color]);

we know we can gdf[(; gender = :X, natiaonlity = :US, hair_color=:blue)] to get a specific subdataframe, but this doesn’t work:

gdf[(; gender = :X, natiaonlity = :US)]

the desired behavior is to merge all hair_color variations

rocco_sprmnt21 · November 14, 2022, 5:59pm

I’m not sure I understand what you’re looking for, but maybe if you do a looser groupby you’ll have the “merge” you want

gdf_hair_color = groupby(df, [:gender, :nationality])

jling · November 14, 2022, 6:21pm

yeah but I SOMETIMES want the hair_color, I want to avoid making a new gdf for every single combination of columns I need.

pdeffebach · November 14, 2022, 6:26pm

I think this has been asked before and this is not possible. You will need to re-group I think.

nilshg · November 14, 2022, 6:32pm

Yes I recall @bkamins answering this before as well, unfortunately the discourse search is hopeless so I can’t find the thread (or maybe it was just on slack)

rocco_sprmnt21 · November 14, 2022, 6:36pm

then something like this

using DataFrames

df=DataFrame(g=rand(["m","f"], 20), n=rand(1:5, 20), hc=rand(['b','w','B','g'],20), val=rand(20))
gdf = groupby(df, [:g, :n, :hc])

df[df.g.=="f" .&&  df.n.==5,:]

?

or one of these


filter(x->x.g[1]=="f" && x.n[1]==5, gdf)
filter(x->x.g[1]=="f" && x.n[1]==5, gdf, ungroup=true)

?

bkamins · November 14, 2022, 6:44pm

This will be most efficient in general. Why do you want to avoid it?

jling · November 14, 2022, 9:46pm

cuz I’m doing explorational work and looking at many potential feature combos?

bkamins · November 14, 2022, 10:25pm

Then if you find a case when doing grouping several times on different columns is inconvenient please let me know and we will think if we can add what you ask for (note though that it will not be fast as it will have O(number of groups) cost, as opposed to O(1) cost of current lookup).

jling · November 15, 2022, 12:50am

I mean it is inconvenient simply because I have to type and name my gdf a different name every time; not anything else like I would have:

gdf1= groupby(df, [:gender])
gdf2 = groupby(df, [:gender, :nationality])
gdf3 = groupby(df, [:gender, :nationality, :hair_color])
gdf4 = groupby(df, [:gender, :hair_color])
gdf5 = groupby(df,[:nationality, :hair_color])

I don’t think people group and look up groups in a loop? Isn’t it far more likely that people:

make split apply combine in a loop (so one groupby per iteration)

pdeffebach · November 15, 2022, 1:46am

This should be useful.

julia> function regroup(gd; kwargs...)
           omitted_keys = setdiff(groupcols(gd), keys(kwargs))
           new_keys = Any[]
           for k in keys(gd)
               res = false
               for c in keys(kwargs)
                   if k[c] == kwargs[c]
                       push!(new_keys, k)
                   end
               end
           end
           gd[new_keys]
       end

overwriting getindex should be trivial from here.

jling · November 15, 2022, 3:02am

thanks but I really just want to know if there’s something already exist; I don’t have a good place to put this utility function

aplavin · November 15, 2022, 3:59am

For general Julia collections and for many table types, see group + addmargins from FlexiGroups.jl:

using FlexiGroups

tbl = ...
gm = group(x -> (;x.gender, x.nationality, x.hair_color), tbl) |> addmargins
# these should work - use `total` to select
# the group containing all values of the corresponding parameter:
gm[(; gender = :X, natiaonlity = :US, hair_color=:blue)]
gm[(; gender = :X, natiaonlity = :US, hair_color=total)]
gm[(; gender = total, natiaonlity = :US, hair_color=total)]

For dense multidimensional grouping (as your case seems to be), keyed arrays can be convenient instead of dictionaries (the default). See the docs, FlexiGroups support this as well.

I don’t think DataFrames and FlexiGroups work together though, so the above can only indirectly be applied in your specific case as you start from a DataFrame.

Topic		Replies	Views
Combining a col from each DF group into a single DF New to Julia question , dataframes	5	322	August 25, 2022
Manipulation of dataframe rows upon repeated values in a given column New to Julia	5	761	April 15, 2021
GroupedDataFrame combine unique non-key values General Usage question , dataframes	2	357	December 4, 2020
DataFrame Groupby New to Julia dataframes	2	2160	April 26, 2018
Comparing DataFrames native API and Query Data	4	1545	September 1, 2017

DataFrame how to `groupby` then index with unspecified keys (merge them)

Related topics