ab2z
March 31, 2022, 9:38am
1
Hi all,
I am new in Julia so I would like to start with a simple question.
In testing the migration activities from old to a new system/setting/etc, we mostly need to compare 2 output data sets and extract the deltas.
Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?
Thanks+regards
nilshg
March 31, 2022, 10:37am
2
It would be helpful if you could provide a bit more detail on what you want to happen - Iβm not familiar with SAS. Something like:
julia> using DataFrames
julia> df1 = DataFrame(rand(1:5, 10, 5), :auto); df2 = DataFrame(rand(1:5, 10, 5), :auto)
10Γ5 DataFrame
Row β x1 x2 x3 x4 x5
β Int64 Int64 Int64 Int64 Int64
ββββββΌβββββββββββββββββββββββββββββββββββ
1 β 4 5 1 3 3
2 β 1 4 1 3 1
3 β 5 2 3 4 3
4 β 2 2 3 3 5
5 β 1 4 4 2 1
6 β 4 3 1 2 1
7 β 5 3 5 1 1
8 β 4 3 3 4 3
9 β 4 2 4 3 2
10 β 1 4 4 2 2
julia> df1 .== df2
10Γ5 DataFrame
Row β x1 x2 x3 x4 x5
β Bool Bool Bool Bool Bool
ββββββΌβββββββββββββββββββββββββββββββββββ
1 β false true true true false
2 β false false true true true
3 β false true false false false
4 β false true false false false
5 β false true false false true
6 β false false true true false
7 β false false false false false
8 β true false false true false
9 β false false false false false
10 β false true false false false
julia> df1 .- df2
10Γ5 DataFrame
Row β x1 x2 x3 x4 x5
β Int64 Int64 Int64 Int64 Int64
ββββββΌβββββββββββββββββββββββββββββββββββ
1 β -2 0 0 0 -2
2 β 3 -2 0 0 0
3 β -3 0 -2 1 1
4 β 3 0 1 -2 -2
5 β 2 0 -2 -1 0
6 β -2 2 0 0 2
7 β -2 1 -4 4 4
8 β 0 2 -1 0 -1
9 β -1 -1 -1 -1 2
10 β 1 0 -2 2 1
1 Like
Assuming you are talking about arrays or simple tables, a solution is to use DeepDiffs.jl
:
julia> tbl_x = [(a=1, b=2), (a=3, b=4)]
julia> tbl_y = [(a=1, b=3), (a=3, b=4)]
julia> using Tables, DeepDiffs
julia> map(deepdiff, tbl_x |> columntable, tbl_y |> columntable)
gives the following colored output:
2 Likes
ab2z
March 31, 2022, 9:06pm
4
I have the following example
old=Dataset(Insurance_Id=[1,2,3],Business_Id=[10,20,30],Amount=[100,200,300])
new=Dataset(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,300,-500,350,700])
The combination of insurance_id*Business_Id gives a unique id.
The delta is on the amount absolute difference greater than 50.
The only matches here are (insurance_id,Business_Id) in {(1,10),(3,30)} so I expect to have all other combinations.
My initial thought is to use innerjoin on [:Insurance_Id => Ins_Id, :Business_Id=>:B_Id]
(using DataFrames.jl) and then compare the two amounts using filter function that are implemented in TableOperations.jl
or other table packages.
(I think you would like to post this in data
topic, as this is something related to table operation, but one function to compare two tables with different shapes seems no exist right now)
Please include code inside of backticks (```) so that itβs easier to read. Hereβs how to do what Frankiewaang suggested:
using DataFrames
old = DataFrame(Insurance_Id = [1,2,3], Business_Id = [10,20,30], Amount = [100,200,300])
new = DataFrame(Ins_Id = [1,3,2,4,3,2], B_Id = [10,40,30,40,30,20], AMT = [100,200,300,-500,350,700])
joined = innerjoin(old, new, on=[:Insurance_Id => :Ins_Id, :Business_Id => :B_Id])
diffs = filter(row -> abs(row.Amount - row.AMT) > 50, joined)
1Γ4 DataFrame
Row β Insurance_Id Business_Id Amount AMT
β Int64 Int64 Int64 Int64
ββββββΌββββββββββββββββββββββββββββββββββββββββββ
1 β 2 20 200 700
1 Like
ab2z
April 5, 2022, 8:38am
7
I can not edit my post and put the code in backticks. maybe next time.