How to create a new DF with only the rows that contain specific values from two different columns

I want to create a new DF (e.g. df_2) based on an existing one (e.g. df_1) which has 9 columns and multiple rows of data. In this new df_2 I want to have all the data from each row, only when the values from 1 column (e.g. :AMT) equal 10, and the values from a second column equal β€œNORMAL”

with TidierData.jl you could do the following

using TidierData

julia> df_1 = DataFrame(
           ID = 1:7,
           Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
           AMT = [10, 20, 10, 15, 10, 10, 25],
           Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
           Score = [95, 85, 92, 88, 76, 89, 91],
           Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
           Category = ["A", "B", "A", "C", "B", "A", "C"],
           Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
           Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
       );

julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")
3Γ—9 DataFrame
 Row β”‚ ID     Name     AMT    Status  Score  Date        Category  Amount  Type   
     β”‚ Int64  String   Int64  String  Int64  String      String    Int64   String 
─────┼────────────────────────────────────────────────────────────────────────────
   1 β”‚     1  Alice       10  NORMAL     95  2024-01-01  A           1000  X
   2 β”‚     3  Charlie     10  NORMAL     92  2024-03-01  A           1000  X
   3 β”‚     5  Eva         10  NORMAL     76  2024-05-01  B           1000  X

Not shown here is TidierData a @chain with multiple sequential manipulations

@chain df_1 begin
     @filter( AMT == 10 && Status == "NORMAL")
# other data manipulations
end
1 Like

When I tried this

julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")

I get an error that @filter is undefined. As an aside, why no colon before AMT i.e. :AMT not AMT and the same for Status

Or just use subset()

using DataFrames
df_1 = DataFrame(
           ID = 1:7,
           Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
           AMT = [10, 20, 10, 15, 10, 10, 25],
           Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
           Score = [95, 85, 92, 88, 76, 89, 91],
           Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
           Category = ["A", "B", "A", "C", "B", "A", "C"],
           Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
           Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
       )
df_2 = subset(df_1, [:AMT, :Status] => ByRow((x, y) -> x==10 && y == "NORMAL"))
3Γ—9 DataFrame
 Row β”‚ ID     Name     AMT    Status  Score  Date        Category  Amount  Type   
     β”‚ Int64  String   Int64  String  Int64  String      String    Int64   String 
─────┼────────────────────────────────────────────────────────────────────────────
   1 β”‚     1  Alice       10  NORMAL     95  2024-01-01  A           1000  X
   2 β”‚     3  Charlie     10  NORMAL     92  2024-03-01  A           1000  X
   3 β”‚     5  Eva         10  NORMAL     76  2024-05-01  B           1000  X

I am not sure why you might be getting that error. If you have previously loaded another package that has @filter you may need to preface it as TidierData.@filter if there is a namespace conflict

but otherwise

using TidierData

should enable it.

As for the lack of :, TidierData enables the use of bare column names throughout all of its macros allowing for a clean interface.

This brief explanation talks a little bit about some of the design choices in TidierData.jl

2 Likes

Ok, that explains it I hadn’t loaded that package. works now thanks

2 Likes