I have the impression that is not possible… not with Julia, anyway…
In Python, it’s quite easy: you can group the lines of a dataframe based on a condition (or even a random vector as long as it has the same number of rows as the dataframe) and then perform any operation…
Let’s say an event occurs every week or so… I want to divide the events in two buckets: the ones where the event happened 6, 7, 8 days later and the rest… And use the nrow function for each group.
I can still add another column where I put the information and do the groupby but I find it corny…
I can’t believe it’s not possible without adding a column…
My question was specifically if it is possible to perform a Groupby based on a condition, though…
To obtain two GroupedDataFrames, if the result is true or false…
julia> using TidierData
julia> @chain df begin
@group_by(gb=contains(first,'a'))
@mutate(
first=lowercase(first),
last=uppercase(last)
)
end
GroupedDataFrame with 2 groups based on key: gb
First Group (2 rows): gb = false
Row │ first last gb
│ String String Bool
─────┼────────────────�
�────────
1 │ chris ZEND false
2 │ jeff BEZANSON false
⋮
Last Group (2 rows): gb = true
Row │ first last gb
│ String String Bool
─────┼─────────────────�
��──────────
1 │ mark KITTISOPIKUL true
2 │ stefan KARPINSKI true
Sure, feel free to continue this thread, and when you do ideally with a minimal working example of what you’re getting in pandas that you’re trying to recreate.
OP, you are right that “group a data frame by something that is not a persistent column in the data frame” is something that is not possible in DataFrames.jl and is unlikely to be added in the future.
The implementation of GroupedDataFrame relies on the grouping column being an existing, named column in the data frame. As mentioned above, DataFramesMeta.jl should probably add this feature, something that groups and transforms in a single step, dropping the grouping column after the transformation. But It’s a low priority because its not that hard to just make a column.
Well, in pandas, you can do such things: df.groupby(df['Sales Rep'].str.split(' ').str[0]).size()
which counts the number of people with the same first name
or use the function pd.Grouper that enables to resample easily a DataFrame with a column of dates: df.groupby(pd.Grouper(key = 'Date', freq = 'Q')).size()
I admit I’m interested by Julia for its performance, it’s less user-friendly than Python-Pandas…
Hm, I don’t know pandas that well (anymore - back when I was using it they still took their name seriously and had a panel data type ) but that doesn’t seem like a groupby is necessary at all, I think this is just
using StatsBase, DataFrames
countmap(first.(split.(string.(df."Sales Rep"))))
Pandas is different, I would not say “less user-friendly” though. Imho, Pandas is one of the less well designed Python libraries which always confuses me.
Overall, for data prep I prefer both R’s data.table and the tidyverse packages to Pandas.
DataFrames is particularly nice if you want to program, i.e., write code without hard-coded column names, aggregation functions etc., as it’s design is very transparent and well integrated with base Julia, i.e., just pass functions operating on vectors. R also has it’s issues in this respect due to non-standard evaluation.
In the end, all of these libraries have their strengths and weaknesses and it matters a lot what you are used to. I would not say that one is unambiguously better than the other though. The Julia ecosystem has improved a lot over the years and is well on par – at least in my opinion – with a solid and clean foundation of DataFrames (or even more general Tables) and several packages such as DataFramesMeta, DataFrameMacros, Query, Tidier etc for conveniently querying data.
I think this limitation (group by a column only) is DataFrames-specific. In Julia, you can use all kinds of arrays as tables, and that way it’s easy to group by a function of a column:
map(group_vg(r -> length(r.name) > 4, tbl)) do gr
(longname=key(gr), count=length(gr)))
end
This kind of tables processing may be somewhat less uniformly documented, simply because you can use powerful generic functions from many packages instead of buying into a specific ecosystem. But it is very flexible and composable in Julia.