Groupby on an expression or a vector?

Chriz_Zend · June 10, 2024, 4:33am

Hello,

I have the impression that is not possible… not with Julia, anyway…

In Python, it’s quite easy: you can group the lines of a dataframe based on a condition (or even a random vector as long as it has the same number of rows as the dataframe) and then perform any operation…

Let’s say an event occurs every week or so… I want to divide the events in two buckets: the ones where the event happened 6, 7, 8 days later and the rest… And use the nrow function for each group.

I can still add another column where I put the information and do the groupby but I find it corny…

I can’t believe it’s not possible without adding a column…

mkitti · June 10, 2024, 4:57am

Seems pretty easy to me…

julia> using DataFrames

julia> df = DataFrame(
          first=["Chris", "Mark", "Jeff", "Stefan"],
          last=["Zend", "Kittisopikul", "Bezanson", "Karpinski"]
       )
4×2 DataFrame
 Row │ first   last
     │ String  String
─────┼──────────────────────
   1 │ Chris   Zend
   2 │ Mark    Kittisopikul
   3 │ Jeff    Bezanson
   4 │ Stefan  Karpinski

julia> df[Bool[1,0,1,1], :]
3×2 DataFrame
 Row │ first   last
     │ String  String
─────┼───────────────────
   1 │ Chris   Zend
   2 │ Jeff    Bezanson
   3 │ Stefan  Karpinski

julia> df[contains.(df.first, 'e'), :]
2×2 DataFrame
 Row │ first   last
     │ String  String
─────┼───────────────────
   1 │ Jeff    Bezanson
   2 │ Stefan  Karpinski

julia> df[contains.(df.first, 'a'), :]
2×2 DataFrame
 Row │ first   last
     │ String  String
─────┼──────────────────────
   1 │ Mark    Kittisopikul
   2 │ Stefan  Karpinski

julia> df[contains.(df.first, 'a'), :] |>
           df->df.first .* " " .* df.last
2-element Vector{String}:
 "Mark Kittisopikul"
 "Stefan Karpinski"

For more advanced operations see these packages:

Chriz_Zend · June 10, 2024, 5:53am

Thank you.

My question was specifically if it is possible to perform a Groupby based on a condition, though…
To obtain two GroupedDataFrames, if the result is true or false…

mkitti · June 10, 2024, 6:22am

Like this?

julia> using TidierData

julia> @chain df begin
           @group_by(gb=contains(first,'a'))
           @mutate(
               first=lowercase(first),
               last=uppercase(last)
           )
       end
GroupedDataFrame with 2 groups based on key: gb
First Group (2 rows): gb = false
 Row │ first   last      gb
     │ String  String    Bool
  ─────┼────────────────�
�────────
   1 │ chris   ZEND      false
   2 │ jeff    BEZANSON  false
⋮
  Last Group (2 rows): gb = true
 Row │ first   last          gb
       │ String  String        Bool
─────┼─────────────────�
��──────────
 1 │ mark    KITTISOPIKUL  true
   2 │ stefan  KARPINSKI     true

nilshg · June 10, 2024, 8:10am

So I think what you’re asking for is basically groupby(df, :column => function) which doesn’t exist and probably won’t exist in DataFrames, but maybe in a convenience package, see my issue here: `groupby` derived columns · Issue #392 · JuliaData/DataFramesMeta.jl · GitHub

DataFramesMacros has a similar functionality here:

https://jkrumbiegel.com/DataFrameMacros.jl/stable/#@groupby

see the @groupby(df, :evenheight = iseven(:height)) which creates an :evenheight column on the fly.

In base Julia/DataFrames you can do something like:

julia> using DataFrames, Dates

julia> df = DataFrame(event_time = rand(Date(2024):Day(1):Date(2025), 100), value = rand(100));

julia> (df[(op).(dayofweek.(df.event_time), 4), :] for op ∈ (<, >)) .|> (x -> combine(groupby(x, :event_time), nrow, :value => sum))
2-element Vector{DataFrame}:
(...)

which I think is what you want?

Chriz_Zend · June 10, 2024, 8:25am

Yes, but without an added column…

Chriz_Zend · June 10, 2024, 8:53am

Not quite… You don’t have two GroupedDataFrames…

But I guess it’s not possible with Julia’s DataFrames… Thank you, anyway

nilshg · June 10, 2024, 8:57am

But that’s just because I called combine?

julia> (df[(op).(dayofweek.(df.event_time), 4), :] for op ∈ (<, >)) .|> (x -> groupby(x, :event_time))
2-element Vector{GroupedDataFrame{DataFrame}}:

Chriz_Zend · June 10, 2024, 10:08am

Maybe… I indeed get several GroupedDataFrames…

When I am not “New to Julia” anymore and understand better your solution, I will come back to your answer…

Thanks.

nilshg · June 10, 2024, 12:39pm

Sure, feel free to continue this thread, and when you do ideally with a minimal working example of what you’re getting in pandas that you’re trying to recreate.

pdeffebach · June 10, 2024, 2:14pm

OP, you are right that “group a data frame by something that is not a persistent column in the data frame” is something that is not possible in DataFrames.jl and is unlikely to be added in the future.

The implementation of GroupedDataFrame relies on the grouping column being an existing, named column in the data frame. As mentioned above, DataFramesMeta.jl should probably add this feature, something that groups and transforms in a single step, dropping the grouping column after the transformation. But It’s a low priority because its not that hard to just make a column.

Chriz_Zend · June 10, 2024, 3:49pm

Well, in pandas, you can do such things:
df.groupby(df['Sales Rep'].str.split(' ').str[0]).size()
which counts the number of people with the same first name
or use the function pd.Grouper that enables to resample easily a DataFrame with a column of dates:
df.groupby(pd.Grouper(key = 'Date', freq = 'Q')).size()

I admit I’m interested by Julia for its performance, it’s less user-friendly than Python-Pandas…

pdeffebach · June 10, 2024, 4:01pm

I would highly encourage you to check out DataFramesMeta.jl as a way to use nicer julia syntax for data cleaning operations.

Perhaps we don’t have that exact feature you want… but on the whole I think the Syntax for data cleaning is a lot nicer in Julia than in pandas.

This seems pretty nice, to be honest (aside from the duplicate :g, which is the feature you want).

@chain df begin 
   @rtransform :g = length(split($"Sales rep", " "))
   @by :g :mean_sales = mean(:sales)
end

nilshg · June 10, 2024, 4:09pm

Hm, I don’t know pandas that well (anymore - back when I was using it they still took their name seriously and had a panel data type ) but that doesn’t seem like a groupby is necessary at all, I think this is just

using StatsBase, DataFrames

countmap(first.(split.(string.(df."Sales Rep"))))

Chriz_Zend · June 10, 2024, 4:46pm

I will take a look on this DataFramesMeta.jl…
Thank you

Chriz_Zend · June 10, 2024, 4:49pm

Interesting, this countmap…

bertschi · June 10, 2024, 4:58pm

Pandas is different, I would not say “less user-friendly” though. Imho, Pandas is one of the less well designed Python libraries which always confuses me.
Overall, for data prep I prefer both R’s data.table and the tidyverse packages to Pandas.
DataFrames is particularly nice if you want to program, i.e., write code without hard-coded column names, aggregation functions etc., as it’s design is very transparent and well integrated with base Julia, i.e., just pass functions operating on vectors. R also has it’s issues in this respect due to non-standard evaluation.
In the end, all of these libraries have their strengths and weaknesses and it matters a lot what you are used to. I would not say that one is unambiguously better than the other though. The Julia ecosystem has improved a lot over the years and is well on par – at least in my opinion – with a solid and clean foundation of DataFrames (or even more general Tables) and several packages such as DataFramesMeta, DataFrameMacros, Query, Tidier etc for conveniently querying data.

davidanthoff · June 10, 2024, 5:04pm

If I understand the question correctly, you can quite easily do this with Query.jl:

using DataFrames, Query

df = DataFrame(
  name=["John", "Sally", "Somethinglonger"],
  age=[23.,42.,92],
  children=[2,3,5]
)

df |>
  @groupby(length(_.name)>4) |>
  @map({longname=key(_), count=length(_)}) |>
  DataFrame

More docs at Standalone Query Commands · Query.jl.

Chriz_Zend · June 10, 2024, 5:23pm

Thank you, I will take a look…

aplavin · June 10, 2024, 6:10pm

I think this limitation (group by a column only) is DataFrames-specific. In Julia, you can use all kinds of arrays as tables, and that way it’s easy to group by a function of a column:

julia> using StructArrays, DataManipulation

julia> tbl = StructArray(
         name=["John", "Sally", "Somethinglonger"],
         age=[23.,42.,92],
         children=[2,3,5]
       )

julia> @p tbl |>
           group_vg(length(_.name) > 4) |>
           map((longname=key(_), count=length(_)))
2-element Vector{@NamedTuple{longname::Bool, count::Int64}}:
 (longname = 0, count = 1)
 (longname = 1, count = 2)

Even without any macros it reads reasonably nice:

map(group_vg(r -> length(r.name) > 4, tbl)) do gr
    (longname=key(gr), count=length(gr)))
end

This kind of tables processing may be somewhat less uniformly documented, simply because you can use powerful generic functions from many packages instead of buying into a specific ecosystem. But it is very flexible and composable in Julia.

Topic		Replies	Views
Groupby and aggregate a dataframe with custom function that return a vector New to Julia dataframes	8	1846	October 18, 2021
Create grouped dataframe by properties of a given column? New to Julia dataframes , grouped-data	9	380	April 26, 2024
Grouping by values in either of two columns Data question	13	772	April 14, 2024
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	704	March 29, 2023
DataFrame Groupby New to Julia dataframes	2	2142	April 26, 2018

Groupby on an expression or a vector?

Related topics