gdf = groupby(df, [:gender, :nationality, :hair_color]);
we know we can
gdf[(; gender = :X, natiaonlity = :US, hair_color=:blue)] to get a specific subdataframe, but this doesn’t work:
gdf[(; gender = :X, natiaonlity = :US)]
the desired behavior is to merge all
I’m not sure I understand what you’re looking for, but maybe if you do a looser groupby you’ll have the “merge” you want
gdf_hair_color = groupby(df, [:gender, :nationality])
yeah but I SOMETIMES want the
hair_color, I want to avoid making a new gdf for every single combination of columns I need.
I think this has been asked before and this is not possible. You will need to re-group I think.
Yes I recall
@bkamins answering this before as well, unfortunately the discourse search is hopeless so I can’t find the thread (or maybe it was just on slack)
then something like this
df=DataFrame(g=rand(["m","f"], 20), n=rand(1:5, 20), hc=rand(['b','w','B','g'],20), val=rand(20))
gdf = groupby(df, [:g, :n, :hc])
df[df.g.=="f" .&& df.n.==5,:]
or one of these
filter(x->x.g=="f" && x.n==5, gdf)
filter(x->x.g=="f" && x.n==5, gdf, ungroup=true)
This will be most efficient in general. Why do you want to avoid it?
cuz I’m doing explorational work and looking at many potential feature combos?
Then if you find a case when doing grouping several times on different columns is inconvenient please let me know and we will think if we can add what you ask for (note though that it will not be fast as it will have O(number of groups) cost, as opposed to O(1) cost of current lookup).
I mean it is inconvenient simply because I have to type and name my gdf a different name every time; not anything else like I would have:
gdf1= groupby(df, [:gender])
gdf2 = groupby(df, [:gender, :nationality])
gdf3 = groupby(df, [:gender, :nationality, :hair_color])
gdf4 = groupby(df, [:gender, :hair_color])
gdf5 = groupby(df,[:nationality, :hair_color])
I don’t think people group and look up groups in a loop? Isn’t it far more likely that people:
make split apply combine in a loop (so one groupby per iteration)
This should be useful.
julia> function regroup(gd; kwargs...)
omitted_keys = setdiff(groupcols(gd), keys(kwargs))
new_keys = Any
for k in keys(gd)
res = false
for c in keys(kwargs)
if k[c] == kwargs[c]
getindex should be trivial from here.
thanks but I really just want to know if there’s something already exist; I don’t have a good place to put this utility function
For general Julia collections and for many table types, see
tbl = ...
gm = group(x -> (;x.gender, x.nationality, x.hair_color), tbl) |> addmargins
# these should work - use `total` to select
# the group containing all values of the corresponding parameter:
gm[(; gender = :X, natiaonlity = :US, hair_color=:blue)]
gm[(; gender = :X, natiaonlity = :US, hair_color=total)]
gm[(; gender = total, natiaonlity = :US, hair_color=total)]
For dense multidimensional grouping (as your case seems to be), keyed arrays can be convenient instead of dictionaries (the default). See the docs,
FlexiGroups support this as well.
I don’t think
FlexiGroups work together though, so the above can only indirectly be applied in your specific case as you start from a