Compare 2 data sets - similar to SAS proc compare

Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?
I could not edit this Comparison 2 data sets , So I decide to open this new topic.

I have the following example

``````old=Dataset(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],Amount=[100,200,300,missing],Account_Id=["x1","x10","x5","x5"])

new=Dataset(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

``````

The combination of insurance_id*Business_Id gives a unique id.

The delta is on the amount (absolute difference greater than 50) and the Account_Id.

The only matches here are (insurance_id,Business_Id) in {(1,10)} so I expect to have all other combinations.

1 Like

is this the result you want?

``````using DataFrames

new=DataFrame(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

aon=antijoin(old, new, on= [:Insurance_Id => :Ins_Id,:Business_Id=>:B_Id])
ano=antijoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])

ij=innerjoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])

fij=filter(r->r.AMT-r.Amount>50,ij)

vcat(fij,aon,ano, cols= :union)
5Γ6 DataFrame
Row β Ins_Id  B_Id   Amount   Account_Id  AMT      Ac_Id
β Int64   Int64  Int64?   String?     Int64?   String?
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β      2     20      200  x10             700  x5
2 β      5     50  missing  x5          missing  missing
3 β      3     40  missing  missing         200  x1
4 β      2     30  missing  missing     missing  x10
5 β      4     40  missing  missing        -500  x10

``````

a sketch for a generic function that does this set of operations

``````
function compare(old, new, oldnames, newnames)
rename!(old,Pair.(oldnames, newnames))
aon=antijoin(old, new, on= newnames)
ano=antijoin(new, old, on= newnames)
ij=innerjoin(new, old, on= newnames)
vcat(ij, aon, ano, cols = :union)
end

``````

There isnβt a single function for comparison in `InMemoryDatasets`. You need a combination of join functions, filter, and custom code for solving your problem. Check also this.

thank you for your answers. I opened a request to InMemoryDatasets look

.
The comparison can now be done with the compare function in this package, too.

``````julia> eq_fun(x::Number, y::Number) = abs(x - y) <= 50

julia> eq_fun(x::AbstractString, y::AbstractString) = isequal(x,y)

julia> eq_fun(x,y) = missing

julia> res = compare(old, new,
on = [:Insurance_Id => :Ins_Id, :Business_Id => :B_Id],
cols = [:Amount => :AMT, :Account_Id => :Ac_Id],
eq = eq_fun)
7Γ6 Dataset
Row β Insurance_Id  Business_Id  obs_id_left  obs_id_right  Amount=>AMT  Account_Id=>Ac_Id
β identity      identity     identity     identity      identity     identity
β Int64?        Int64?       Int32?       Int32?        Bool?        Bool?
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β            1           10            1             1         true               true
2 β            2           20            2             6        false              false
3 β            3           30            3             5         true              false
4 β            5           50            4       missing      missing            missing
5 β            2           30      missing             3      missing            missing
6 β            3           40      missing             2      missing            missing
7 β            4           40      missing             4      missing            missing
``````
4 Likes

The speed of development of the new features is truly remarkable.
The function seems quite complex. I canβt understand the role of the third method `eq_fun (x, y) = missing`.
Even without it or defining it in a different way (for instance, `eq_fun (x, y) = false`), I get the same result.