Compare 2 data sets - similar to SAS proc compare

Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?
I could not edit this Comparison 2 data sets , So I decide to open this new topic.

I have the following example

old=Dataset(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],Amount=[100,200,300,missing],Account_Id=["x1","x10","x5","x5"])

new=Dataset(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

The combination of insurance_id*Business_Id gives a unique id.

The delta is on the amount (absolute difference greater than 50) and the Account_Id.

The only matches here are (insurance_id,Business_Id) in {(1,10)} so I expect to have all other combinations.

1 Like

is this the result you want?

using DataFrames
old=DataFrame(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],Amount=[100,200,300,missing],Account_Id=["x1","x10","x5","x5"])

new=DataFrame(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

aon=antijoin(old, new, on= [:Insurance_Id => :Ins_Id,:Business_Id=>:B_Id])
ano=antijoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])
rename!(aon,:Insurance_Id => :Ins_Id ,:Business_Id=>:B_Id)

ij=innerjoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])

fij=filter(r->r.AMT-r.Amount>50,ij)

vcat(fij,aon,ano, cols= :union)
5Γ—6 DataFrame
 Row β”‚ Ins_Id  B_Id   Amount   Account_Id  AMT      Ac_Id   
     β”‚ Int64   Int64  Int64?   String?     Int64?   String?
─────┼──────────────────────────────────────────────────────
   1 β”‚      2     20      200  x10             700  x5
   2 β”‚      5     50  missing  x5          missing  missing
   3 β”‚      3     40  missing  missing         200  x1
   4 β”‚      2     30  missing  missing     missing  x10
   5 β”‚      4     40  missing  missing        -500  x10

a sketch for a generic function that does this set of operations


function compare(old, new, oldnames, newnames)
    rename!(old,Pair.(oldnames, newnames))
    aon=antijoin(old, new, on= newnames)
    ano=antijoin(new, old, on= newnames)
    ij=innerjoin(new, old, on= newnames)
    vcat(ij, aon, ano, cols = :union)
end

1 Like

There isn’t a single function for comparison in InMemoryDatasets. You need a combination of join functions, filter, and custom code for solving your problem. Check also this.

thank you for your answers. I opened a request to InMemoryDatasets look
https://github.com/sl-solution/InMemoryDatasets.jl/issues/53
.
The comparison can now be done with the compare function in this package, too.

julia> eq_fun(x::Number, y::Number) = abs(x - y) <= 50

julia> eq_fun(x::AbstractString, y::AbstractString) = isequal(x,y)

julia> eq_fun(x,y) = missing

julia> res = compare(old, new, 
                     on = [:Insurance_Id => :Ins_Id, :Business_Id => :B_Id],
                     cols = [:Amount => :AMT, :Account_Id => :Ac_Id],
                     eq = eq_fun)
7Γ—6 Dataset
 Row β”‚ Insurance_Id  Business_Id  obs_id_left  obs_id_right  Amount=>AMT  Account_Id=>Ac_Id 
     β”‚ identity      identity     identity     identity      identity     identity          
     β”‚ Int64?        Int64?       Int32?       Int32?        Bool?        Bool?             
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 β”‚            1           10            1             1         true               true
   2 β”‚            2           20            2             6        false              false
   3 β”‚            3           30            3             5         true              false
   4 β”‚            5           50            4       missing      missing            missing 
   5 β”‚            2           30      missing             3      missing            missing 
   6 β”‚            3           40      missing             2      missing            missing 
   7 β”‚            4           40      missing             4      missing            missing 
5 Likes

The speed of development of the new features is truly remarkable. :clap:
The function seems quite complex. I can’t understand the role of the third method eq_fun (x, y) = missing.
Even without it or defining it in a different way (for instance, eq_fun (x, y) = false), I get the same result.