I want to create a new DF (e.g. df_2) based on an existing one (e.g. df_1) which has 9 columns and multiple rows of data. In this new df_2 I want to have all the data from each row, only when the values from 1 column (e.g. :AMT) equal 10, and the values from a second column equal βNORMALβ
with TidierData.jl you could do the following
using TidierData
julia> df_1 = DataFrame(
ID = 1:7,
Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
AMT = [10, 20, 10, 15, 10, 10, 25],
Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
Score = [95, 85, 92, 88, 76, 89, 91],
Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
Category = ["A", "B", "A", "C", "B", "A", "C"],
Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
);
julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")
3Γ9 DataFrame
Row β ID Name AMT Status Score Date Category Amount Type
β Int64 String Int64 String Int64 String String Int64 String
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 1 Alice 10 NORMAL 95 2024-01-01 A 1000 X
2 β 3 Charlie 10 NORMAL 92 2024-03-01 A 1000 X
3 β 5 Eva 10 NORMAL 76 2024-05-01 B 1000 X
Not shown here is TidierData a @chain
with multiple sequential manipulations
@chain df_1 begin
@filter( AMT == 10 && Status == "NORMAL")
# other data manipulations
end
When I tried this
julia> df_2 = @filter(df_1, AMT == 10 && Status == "NORMAL")
I get an error that @filter is undefined. As an aside, why no colon before AMT i.e. :AMT not AMT and the same for Status
Or just use subset()
using DataFrames
df_1 = DataFrame(
ID = 1:7,
Name = ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace"],
AMT = [10, 20, 10, 15, 10, 10, 25],
Status = ["NORMAL", "ABNORMAL", "NORMAL", "NORMAL", "NORMAL", "ABNORMAL", "NORMAL"],
Score = [95, 85, 92, 88, 76, 89, 91],
Date = ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01", "2024-06-01", "2024-07-01"],
Category = ["A", "B", "A", "C", "B", "A", "C"],
Amount = [1000, 1500, 1000, 2000, 1000, 1500, 2000],
Type = ["X", "Y", "X", "Y", "X", "Y", "X"]
)
df_2 = subset(df_1, [:AMT, :Status] => ByRow((x, y) -> x==10 && y == "NORMAL"))
3Γ9 DataFrame
Row β ID Name AMT Status Score Date Category Amount Type
β Int64 String Int64 String Int64 String String Int64 String
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 1 Alice 10 NORMAL 95 2024-01-01 A 1000 X
2 β 3 Charlie 10 NORMAL 92 2024-03-01 A 1000 X
3 β 5 Eva 10 NORMAL 76 2024-05-01 B 1000 X
I am not sure why you might be getting that error. If you have previously loaded another package that has @filter
you may need to preface it as TidierData.@filter
if there is a namespace conflict
but otherwise
using TidierData
should enable it.
As for the lack of :
, TidierData enables the use of bare column names throughout all of its macros allowing for a clean interface.
This brief explanation talks a little bit about some of the design choices in TidierData.jl
Ok, that explains it I hadnβt loaded that package. works now thanks