Groupby on an expression or a vector?

Hello,

I have the impression that is not possible… not with Julia, anyway…

In Python, it’s quite easy: you can group the lines of a dataframe based on a condition (or even a random vector as long as it has the same number of rows as the dataframe) and then perform any operation…

Let’s say an event occurs every week or so… I want to divide the events in two buckets: the ones where the event happened 6, 7, 8 days later and the rest… And use the nrow function for each group.

I can still add another column where I put the information and do the groupby but I find it corny…

I can’t believe it’s not possible without adding a column…

Seems pretty easy to me…

julia> using DataFrames

julia> df = DataFrame(
          first=["Chris", "Mark", "Jeff", "Stefan"],
          last=["Zend", "Kittisopikul", "Bezanson", "Karpinski"]
       )
4×2 DataFrame
 Row │ first   last
     │ String  String
─────┼──────────────────────
   1 │ Chris   Zend
   2 │ Mark    Kittisopikul
   3 │ Jeff    Bezanson
   4 │ Stefan  Karpinski

julia> df[Bool[1,0,1,1], :]
3×2 DataFrame
 Row │ first   last
     │ String  String
─────┼───────────────────
   1 │ Chris   Zend
   2 │ Jeff    Bezanson
   3 │ Stefan  Karpinski

julia> df[contains.(df.first, 'e'), :]
2×2 DataFrame
 Row │ first   last
     │ String  String
─────┼───────────────────
   1 │ Jeff    Bezanson
   2 │ Stefan  Karpinski

julia> df[contains.(df.first, 'a'), :]
2×2 DataFrame
 Row │ first   last
     │ String  String
─────┼──────────────────────
   1 │ Mark    Kittisopikul
   2 │ Stefan  Karpinski

julia> df[contains.(df.first, 'a'), :] |>
           df->df.first .* " " .* df.last
2-element Vector{String}:
 "Mark Kittisopikul"
 "Stefan Karpinski"

For more advanced operations see these packages:

Thank you. :slightly_smiling_face:

My question was specifically if it is possible to perform a Groupby based on a condition, though…
To obtain two GroupedDataFrames, if the result is true or false…

Like this?

julia> using TidierData

julia> @chain df begin
           @group_by(gb=contains(first,'a'))
           @mutate(
               first=lowercase(first),
               last=uppercase(last)
           )
       end
GroupedDataFrame with 2 groups based on key: gb
First Group (2 rows): gb = false
 Row │ first   last      gb
     │ String  String    Bool
  ─────┼────────────────�
�────────
   1 │ chris   ZEND      false
   2 │ jeff    BEZANSON  false
⋮
  Last Group (2 rows): gb = true
 Row │ first   last          gb
       │ String  String        Bool
─────┼─────────────────�
��──────────
 1 │ mark    KITTISOPIKUL  true
   2 │ stefan  KARPINSKI     true
1 Like

So I think what you’re asking for is basically groupby(df, :column => function) which doesn’t exist and probably won’t exist in DataFrames, but maybe in a convenience package, see my issue here: `groupby` derived columns · Issue #392 · JuliaData/DataFramesMeta.jl · GitHub

DataFramesMacros has a similar functionality here:

https://jkrumbiegel.com/DataFrameMacros.jl/stable/#@groupby

see the @groupby(df, :evenheight = iseven(:height)) which creates an :evenheight column on the fly.

In base Julia/DataFrames you can do something like:

julia> using DataFrames, Dates

julia> df = DataFrame(event_time = rand(Date(2024):Day(1):Date(2025), 100), value = rand(100));

julia> (df[(op).(dayofweek.(df.event_time), 4), :] for op ∈ (<, >)) .|> (x -> combine(groupby(x, :event_time), nrow, :value => sum))
2-element Vector{DataFrame}:
(...)

which I think is what you want?

1 Like

Yes, but without an added column… :slightly_smiling_face:

Not quite… You don’t have two GroupedDataFrames… :neutral_face:

But I guess it’s not possible with Julia’s DataFrames… Thank you, anyway :slightly_smiling_face:

But that’s just because I called combine?

julia> (df[(op).(dayofweek.(df.event_time), 4), :] for op ∈ (<, >)) .|> (x -> groupby(x, :event_time))
2-element Vector{GroupedDataFrame{DataFrame}}:

Maybe… I indeed get several GroupedDataFrames…

When I am not “New to Julia” anymore and understand better your solution, I will come back to your answer…

Thanks.

Sure, feel free to continue this thread, and when you do ideally with a minimal working example of what you’re getting in pandas that you’re trying to recreate.

OP, you are right that “group a data frame by something that is not a persistent column in the data frame” is something that is not possible in DataFrames.jl and is unlikely to be added in the future.

The implementation of GroupedDataFrame relies on the grouping column being an existing, named column in the data frame. As mentioned above, DataFramesMeta.jl should probably add this feature, something that groups and transforms in a single step, dropping the grouping column after the transformation. But It’s a low priority because its not that hard to just make a column.

2 Likes

Well, in pandas, you can do such things:
df.groupby(df['Sales Rep'].str.split(' ').str[0]).size()
which counts the number of people with the same first name
or use the function pd.Grouper that enables to resample easily a DataFrame with a column of dates:
df.groupby(pd.Grouper(key = 'Date', freq = 'Q')).size()

I admit I’m interested by Julia for its performance, it’s less user-friendly than Python-Pandas…

I would highly encourage you to check out DataFramesMeta.jl as a way to use nicer julia syntax for data cleaning operations.

Perhaps we don’t have that exact feature you want… but on the whole I think the Syntax for data cleaning is a lot nicer in Julia than in pandas.

This seems pretty nice, to be honest (aside from the duplicate :g, which is the feature you want).

@chain df begin 
   @rtransform :g = length(split($"Sales rep", " "))
   @by :g :mean_sales = mean(:sales)
end
3 Likes

Hm, I don’t know pandas that well (anymore - back when I was using it they still took their name seriously and had a panel data type :smiley: ) but that doesn’t seem like a groupby is necessary at all, I think this is just

using StatsBase, DataFrames

countmap(first.(split.(string.(df."Sales Rep"))))
3 Likes

I will take a look on this DataFramesMeta.jl…
Thank you :slightly_smiling_face:

Interesting, this countmap… :slightly_smiling_face:

Pandas is different, I would not say “less user-friendly” though. Imho, Pandas is one of the less well designed Python libraries which always confuses me.
Overall, for data prep I prefer both R’s data.table and the tidyverse packages to Pandas.
DataFrames is particularly nice if you want to program, i.e., write code without hard-coded column names, aggregation functions etc., as it’s design is very transparent and well integrated with base Julia, i.e., just pass functions operating on vectors. R also has it’s issues in this respect due to non-standard evaluation.
In the end, all of these libraries have their strengths and weaknesses and it matters a lot what you are used to. I would not say that one is unambiguously better than the other though. The Julia ecosystem has improved a lot over the years and is well on par – at least in my opinion – with a solid and clean foundation of DataFrames (or even more general Tables) and several packages such as DataFramesMeta, DataFrameMacros, Query, Tidier etc for conveniently querying data.

3 Likes

If I understand the question correctly, you can quite easily do this with Query.jl:

using DataFrames, Query

df = DataFrame(
  name=["John", "Sally", "Somethinglonger"],
  age=[23.,42.,92],
  children=[2,3,5]
)

df |>
  @groupby(length(_.name)>4) |>
  @map({longname=key(_), count=length(_)}) |>
  DataFrame

More docs at Standalone Query Commands · Query.jl.

Thank you, I will take a look… :slightly_smiling_face:

I think this limitation (group by a column only) is DataFrames-specific. In Julia, you can use all kinds of arrays as tables, and that way it’s easy to group by a function of a column:

julia> using StructArrays, DataManipulation

julia> tbl = StructArray(
         name=["John", "Sally", "Somethinglonger"],
         age=[23.,42.,92],
         children=[2,3,5]
       )

julia> @p tbl |>
           group_vg(length(_.name) > 4) |>
           map((longname=key(_), count=length(_)))
2-element Vector{@NamedTuple{longname::Bool, count::Int64}}:
 (longname = 0, count = 1)
 (longname = 1, count = 2)

Even without any macros it reads reasonably nice:

map(group_vg(r -> length(r.name) > 4, tbl)) do gr
    (longname=key(gr), count=length(gr)))
end

This kind of tables processing may be somewhat less uniformly documented, simply because you can use powerful generic functions from many packages instead of buying into a specific ecosystem. But it is very flexible and composable in Julia.

1 Like