DataFramesMeta custom filter: by groups of A, apply filter on B

I need to subset a dataframe by groups. that is, in each group there is a different filtering condition to apply. here is a MWE

using DataFrames, DataFramesMeta

julia> df = DataFrame(A = repeat(1:3,3), B = rand(1:9,9),C = rand(9))
9×3 DataFrame
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 2     │ 5     │ 0.675797 │
│ 3   │ 3     │ 9     │ 0.597738 │
│ 4   │ 1     │ 6     │ 0.593364 │
│ 5   │ 2     │ 2     │ 0.156473 │
│ 6   │ 3     │ 7     │ 0.859696 │
│ 7   │ 1     │ 9     │ 0.951905 │
│ 8   │ 2     │ 6     │ 0.196368 │
│ 9   │ 3     │ 1     │ 0.292369 │

Suppose for each group of A I want to discard the row where B is largest within that group:


julia> @linq df |>
           groupby(:A) |>
               where( :B != maximum(:B) )
GroupedDataFrame with 3 groups based on key: A
First Group (3 rows): A = 1
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 1     │ 6     │ 0.593364 │
│ 3   │ 1     │ 9     │ 0.951905 │
⋮
Last Group (3 rows): A = 3
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 3     │ 9     │ 0.597738 │
│ 2   │ 3     │ 7     │ 0.859696 │
│ 3   │ 3     │ 1     │ 0.292369 │

that’s not it. let me try creating a grouped dataframe and work on that:



julia> g = @linq df |>
           groupby(:A)

julia> filter(x -> (x[:B] < 9),g[1])
2×3 DataFrame
│ Row │ A     │ B     │ C        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.676259 │
│ 2   │ 1     │ 6     │ 0.593364 │

that works! of course I cannot hardcode 9, so i must find out what the largest value is:


julia> filter(x -> (x[:B] < maximum(x[:B])),g[1])
0×3 DataFrame

julia> filter(x -> (x[:B] .< maximum(x[:B])),g[1])
0×3 DataFrame

what’s the mistake I’m making? thanks

1 Like

where filters groups, not rows. So it can’t used for what you’re trying to do. That should probably change, please file an issue in DataFramesMeta.

For now you can do this:

map(gd) do d
    d[d.B .< maximum(d.B), :]
end

(filter inside the do will also work, but you need to store maximum(d.B) to avoid recomputing it for each row.)

1 Like

thanks! thats pretty cool with the map actually (and makes a lot of sense, now that i think about it).

Does maximum(d.B inside the map run the function multiple times? Is the compiler smart enough to figure out not to re-run the function over and over again? At a deeper level, will this get cached in the CPU so that multiple calls are negligible?

maximum is called once for each group, so that should be reasonably efficient. It could be made a bit faster by returning a view (using view) instead of a new data frame.