How to create a new DF with only the rows that contain specific values from two different columns

cocarroll1 · September 15, 2024, 1:29pm

I want to create a new DF (e.g. df_2) based on an existing one (e.g. df_1) which has 9 columns and multiple rows of data. In this new df_2 I want to have all the data from each row, only when the values from 1 column (e.g. :AMT) equal 10, and the values from a second column equal “NORMAL”

drizk1 · September 15, 2024, 1:40pm

with TidierData.jl you could do the following

using TidierData

julia> df_1 = DataFrame(
           ID = 1:7,
           Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
           AMT = [10, 20, 10, 15, 10, 10, 25],
           Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
           Score = [95, 85, 92, 88, 76, 89, 91],
           Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
           Category = ["A", "B", "A", "C", "B", "A", "C"],
           Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
           Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
       );

julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")
3×9 DataFrame
 Row │ ID     Name     AMT    Status  Score  Date        Category  Amount  Type   
     │ Int64  String   Int64  String  Int64  String      String    Int64   String 
─────┼────────────────────────────────────────────────────────────────────────────
   1 │     1  Alice       10  NORMAL     95  2024-01-01  A           1000  X
   2 │     3  Charlie     10  NORMAL     92  2024-03-01  A           1000  X
   3 │     5  Eva         10  NORMAL     76  2024-05-01  B           1000  X

Not shown here is TidierData a @chain with multiple sequential manipulations

@chain df_1 begin
     @filter( AMT == 10 && Status == "NORMAL")
# other data manipulations
end

cocarroll1 · September 15, 2024, 2:08pm

When I tried this

julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")

I get an error that @filter is undefined. As an aside, why no colon before AMT i.e. :AMT not AMT and the same for Status

TimG · September 15, 2024, 2:09pm

Or just use subset()

using DataFrames
df_1 = DataFrame(
           ID = 1:7,
           Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
           AMT = [10, 20, 10, 15, 10, 10, 25],
           Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
           Score = [95, 85, 92, 88, 76, 89, 91],
           Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
           Category = ["A", "B", "A", "C", "B", "A", "C"],
           Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
           Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
       )
df_2 = subset(df_1, [:AMT, :Status] => ByRow((x, y) -> x==10 && y == "NORMAL"))

3×9 DataFrame
 Row │ ID     Name     AMT    Status  Score  Date        Category  Amount  Type   
     │ Int64  String   Int64  String  Int64  String      String    Int64   String 
─────┼────────────────────────────────────────────────────────────────────────────
   1 │     1  Alice       10  NORMAL     95  2024-01-01  A           1000  X
   2 │     3  Charlie     10  NORMAL     92  2024-03-01  A           1000  X
   3 │     5  Eva         10  NORMAL     76  2024-05-01  B           1000  X

drizk1 · September 15, 2024, 2:25pm

I am not sure why you might be getting that error. If you have previously loaded another package that has @filter you may need to preface it as TidierData.@filter if there is a namespace conflict

but otherwise

using TidierData

should enable it.

As for the lack of :, TidierData enables the use of bare column names throughout all of its macros allowing for a clean interface.

This brief explanation talks a little bit about some of the design choices in TidierData.jl

cocarroll1 · September 15, 2024, 2:28pm

Ok, that explains it I hadn’t loaded that package. works now thanks

Topic		Replies	Views
Filtering DataFrame based on two conditions General Usage dataframes	1	328	October 25, 2022
Is it possible to use filter(:col => myFilter, df) with additional constant input variables? New to Julia question , dataframes	12	795	January 5, 2022
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24043	April 27, 2024
Dataframe Filter New to Julia question , dataframes	6	6372	March 26, 2022
DataFrame Filtering New to Julia question , dataframes	10	1121	September 20, 2023

How to create a new DF with only the rows that contain specific values from two different columns

Related topics