DataFrame how to `groupby` then index with unspecified keys (merge them)

gdf = groupby(df, [:gender, :nationality, :hair_color]);

we know we can gdf[(; gender = :X, natiaonlity = :US, hair_color=:blue)] to get a specific subdataframe, but this doesn’t work:

gdf[(; gender = :X, natiaonlity = :US)]

the desired behavior is to merge all hair_color variations

I’m not sure I understand what you’re looking for, but maybe if you do a looser groupby you’ll have the “merge” you want

gdf_hair_color = groupby(df, [:gender, :nationality])

yeah but I SOMETIMES want the hair_color, I want to avoid making a new gdf for every single combination of columns I need.

I think this has been asked before and this is not possible. You will need to re-group I think.

1 Like

Yes I recall @bkamins answering this before as well, unfortunately the discourse search is hopeless so I can’t find the thread (or maybe it was just on slack)

1 Like

then something like this

using DataFrames

df=DataFrame(g=rand(["m","f"], 20), n=rand(1:5, 20), hc=rand(['b','w','B','g'],20), val=rand(20))
gdf = groupby(df, [:g, :n, :hc])

df[df.g.=="f" .&&  df.n.==5,:]

?

or one of these


filter(x->x.g[1]=="f" && x.n[1]==5, gdf)
filter(x->x.g[1]=="f" && x.n[1]==5, gdf, ungroup=true)

?

This will be most efficient in general. Why do you want to avoid it?

1 Like

cuz I’m doing explorational work and looking at many potential feature combos?

Then if you find a case when doing grouping several times on different columns is inconvenient please let me know and we will think if we can add what you ask for (note though that it will not be fast as it will have O(number of groups) cost, as opposed to O(1) cost of current lookup).

I mean it is inconvenient simply because I have to type and name my gdf a different name every time; not anything else like I would have:

gdf1= groupby(df, [:gender])
gdf2 = groupby(df, [:gender, :nationality])
gdf3 = groupby(df, [:gender, :nationality, :hair_color])
gdf4 = groupby(df, [:gender, :hair_color])
gdf5 = groupby(df,[:nationality, :hair_color])

I don’t think people group and look up groups in a loop? Isn’t it far more likely that people:

make split apply combine in a loop (so one groupby per iteration)

This should be useful.

julia> function regroup(gd; kwargs...)
           omitted_keys = setdiff(groupcols(gd), keys(kwargs))
           new_keys = Any[]
           for k in keys(gd)
               res = false
               for c in keys(kwargs)
                   if k[c] == kwargs[c]
                       push!(new_keys, k)
                   end
               end
           end
           gd[new_keys]
       end

overwriting getindex should be trivial from here.

2 Likes

thanks but I really just want to know if there’s something already exist; I don’t have a good place to put this utility function

For general Julia collections and for many table types, see group + addmargins from FlexiGroups.jl:

using FlexiGroups

tbl = ...
gm = group(x -> (;x.gender, x.nationality, x.hair_color), tbl) |> addmargins
# these should work - use `total` to select
# the group containing all values of the corresponding parameter:
gm[(; gender = :X, natiaonlity = :US, hair_color=:blue)]
gm[(; gender = :X, natiaonlity = :US, hair_color=total)]
gm[(; gender = total, natiaonlity = :US, hair_color=total)]

For dense multidimensional grouping (as your case seems to be), keyed arrays can be convenient instead of dictionaries (the default). See the docs, FlexiGroups support this as well.

I don’t think DataFrames and FlexiGroups work together though, so the above can only indirectly be applied in your specific case as you start from a DataFrame.

1 Like