DataFramesMeta custom filter: by groups of A, apply filter on B

floswald · May 24, 2019, 8:13am

I need to subset a dataframe by groups. that is, in each group there is a different filtering condition to apply. here is a MWE

using DataFrames, DataFramesMeta

julia> df = DataFrame(A = repeat(1:3,3), B = rand(1:9,9),C = rand(9))
9×3 DataFrame
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 2     │ 5     │ 0.675797 │
│ 3   │ 3     │ 9     │ 0.597738 │
│ 4   │ 1     │ 6     │ 0.593364 │
│ 5   │ 2     │ 2     │ 0.156473 │
│ 6   │ 3     │ 7     │ 0.859696 │
│ 7   │ 1     │ 9     │ 0.951905 │
│ 8   │ 2     │ 6     │ 0.196368 │
│ 9   │ 3     │ 1     │ 0.292369 │

Suppose for each group of A I want to discard the row where B is largest within that group:


julia> @linq df |>
           groupby(:A) |>
               where( :B != maximum(:B) )
GroupedDataFrame with 3 groups based on key: A
First Group (3 rows): A = 1
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 1     │ 6     │ 0.593364 │
│ 3   │ 1     │ 9     │ 0.951905 │
⋮
Last Group (3 rows): A = 3
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 3     │ 9     │ 0.597738 │
│ 2   │ 3     │ 7     │ 0.859696 │
│ 3   │ 3     │ 1     │ 0.292369 │

that’s not it. let me try creating a grouped dataframe and work on that:



julia> g = @linq df |>
           groupby(:A)

julia> filter(x -> (x[:B] < 9),g[1])
2×3 DataFrame
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 1     │ 6     │ 0.593364 │

that works! of course I cannot hardcode 9, so i must find out what the largest value is:


julia> filter(x -> (x[:B] < maximum(x[:B])),g[1])
0×3 DataFrame

julia> filter(x -> (x[:B] .< maximum(x[:B])),g[1])
0×3 DataFrame

what’s the mistake I’m making? thanks

nalimilan · May 25, 2019, 3:02pm

where filters groups, not rows. So it can’t used for what you’re trying to do. That should probably change, please file an issue in DataFramesMeta.

For now you can do this:

map(gd) do d
    d[d.B .< maximum(d.B), :]
end

(filter inside the do will also work, but you need to store maximum(d.B) to avoid recomputing it for each row.)

floswald · May 27, 2019, 5:48pm

thanks! thats pretty cool with the map actually (and makes a lot of sense, now that i think about it).

affans · May 27, 2019, 6:26pm

Does maximum(d.B inside the map run the function multiple times? Is the compiler smart enough to figure out not to re-run the function over and over again? At a deeper level, will this get cached in the CPU so that multiple calls are negligible?

nalimilan · May 27, 2019, 6:44pm

maximum is called once for each group, so that should be reasonably efficient. It could be made a bit faster by returning a view (using view) instead of a new data frame.

Topic		Replies	Views
Filter doesn't work on grouped dataframe General Usage dataframes	5	1558	February 4, 2022
Group Selection by Condition in DataFramesMeta.jl Data dataframes , dataframesmeta	1	359	April 21, 2022
Filter rows after a perticular value in column New to Julia dataframes	5	1121	April 26, 2022
How DataFramesMeta @by works if you want to groupby two columns? New to Julia dataframes , dataframesmeta	10	772	October 31, 2022
How to efficiently apply user-defined functions against grouped dataframes in `DataFrames.jl`? New to Julia dataframes	6	355	February 10, 2024

DataFramesMeta custom filter: by groups of A, apply filter on B

Related topics