Regarding filtering/creating dataframe

Hello,

I have a dataframe where I want to filter and grab a subset of rows. In this example, I want to grab the rows for each of :B where :C is the smallest for :B:

df.A = rand(10)
df.B = repeat(["A","B","C","D","E"],2)
df.C = [1,2,3,1,2,3,1,2,3,1]
df
A B C
Float64 String Int64
1 0.376752 A 1
2 0.416886 B 2
3 0.371288 C 3
4 0.645706 D 1
5 0.925196 E 2
6 0.708123 A 3
7 0.118586 B 1
8 0.723375 C 2
9 0.243351 D 3
10 0.775513 E 1

And I want:

A B C
Float64 String Int64
1 0.376752 A 1
2 0.118586 B 1
3 0.723375 C 2
4 0.645706 D 1
5 0.775513 E 1

I tried creating an empty DataFrame to push!() to, but you can’t do that without specifying the column types first. My MVE is simple here, but I didn’t find it documented how I could create a new DataFrame based on the structure of an existing one without copying and deleting all the rows.

What would be a recommended technique here?

Here’s one way using the basic split-apply-combine methods for DataFrames:

df2 = by(df, :B) do subdf
    i = argmin(subdf.C)
    (A = subdf.A[i], C = subdf.C[i])
end

This rearranges the column order, but you can set it back by doing df2 = df2[[:A, :B, :C]] (if it matters to you).

3 Likes

Here is the Query.jl way of doing this:

df |> @groupby(_.B) |> @map(_[argmin(_.C)]) |> DataFrame

Note that the _ in the @map stands for each group. Each group here is a table, so first we extract the C column from that table with _.C, find the index of the smallest element in that column with argmin, and then index into the table to extract that full row with _[row_number]. So for each group @map returns one row, and then in turn constitutes a new table, which we materialize into a DataFrame.

If we had this the @map part could be written as @map(minimum(_, i->i.C)), which would be a bit more elegant.

1 Like

You can use similar(df, 0) to create a data frame with the same columns as df. With this PR it will even be possible to push rows to an empty data frame and have new columns be created automatically.

But in general the solutions proposed in other comments are probably better.

1 Like