I’m trying to filter the SubDataFrames with less than a given length (nrow) out of a GroupedDataFrame, but filter() doesn’t seem to recognize the anonymus function i’m passing to it and throws a MethodError:
ERROR: LoadError: MethodError: no method matching filter!(::var"#2#3", ::GroupedDataFrame{DataFrame})
The function `filter!` exists, but no method is defined for this combination of argument types.
Here’s a MWE:
using DataFrames;
sections = [
1, 1, 1, 1,
2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3,
4, 4]
df = DataFrame(A=sections, B=0)
grouped_df = groupby(df, :A)
filter!(sub_df->(nrow(sub_df) <= 5), grouped_df)
filter is disfavored in recent DataFrames versions. Use subset, instead.
julia> subset!(grouped_df, :A => ByRow(x -> x <= 5))
17×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 1 0
3 │ 1 0
4 │ 1 0
5 │ 2 0
6 │ 2 0
⋮ │ ⋮ ⋮
13 │ 3 0
14 │ 3 0
15 │ 3 0
16 │ 4 0
17 │ 4 0
6 rows omitted
This code seems to filter by row, but I’m trying to filter out entire sub dataframes
How could I do that?
Sorry. Misapprehended the MWE.
julia> # Filter to keep only groups with 5 or fewer rows
filtered_gdf = filter(subdf -> nrow(subdf) <= 5, grouped_df)
GroupedDataFrame with 3 groups based on key: A
First Group (4 rows): A = 1
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 1 0
3 │ 1 0
4 │ 1 0
⋮
Last Group (2 rows): A = 4
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 4 0
julia> # Or if you want a regular DataFrame back
filtered_df = filter(subdf -> nrow(subdf) <= 5, grouped_df, ungroup=true)
11×2 DataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 1 0
3 │ 1 0
4 │ 1 0
5 │ 3 0
6 │ 3 0
7 │ 3 0
8 │ 3 0
9 │ 3 0
10 │ 4 0
11 │ 4 0
Wait, that’s literally the same i tried at first, except it doesn’t use the mutating version…
I just tested it and yeah, filter!() throws an error while filter() works perfectly fine
Do you know why could that be?
I guess that’s down to the nature of the object being mutated, which is a GroupedDataFrame and not its constituent SubDataFrames, so we’ll have to assign back.
julia> grouped_df = [g for g in grouped_df if nrow(g) <= 5]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
4×2 SubDataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 1 0
2 │ 1 0
3 │ 1 0
4 │ 1 0
5×2 SubDataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 3 0
2 │ 3 0
3 │ 3 0
4 │ 3 0
5 │ 3 0
2×2 SubDataFrame
Row │ A B
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 4 0
1 Like