Compare 2 data sets - similar to SAS proc compare

Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?
I could not edit this Comparison 2 data sets , So I decide to open this new topic.

I have the following example



The combination of insurance_id*Business_Id gives a unique id.

The delta is on the amount (absolute difference greater than 50) and the Account_Id.

The only matches here are (insurance_id,Business_Id) in {(1,10)} so I expect to have all other combinations.

1 Like

is this the result you want?

using DataFrames


aon=antijoin(old, new, on= [:Insurance_Id => :Ins_Id,:Business_Id=>:B_Id])
ano=antijoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])
rename!(aon,:Insurance_Id => :Ins_Id ,:Business_Id=>:B_Id)

ij=innerjoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])


vcat(fij,aon,ano, cols= :union)
5Γ—6 DataFrame
 Row β”‚ Ins_Id  B_Id   Amount   Account_Id  AMT      Ac_Id   
     β”‚ Int64   Int64  Int64?   String?     Int64?   String?
   1 β”‚      2     20      200  x10             700  x5
   2 β”‚      5     50  missing  x5          missing  missing
   3 β”‚      3     40  missing  missing         200  x1
   4 β”‚      2     30  missing  missing     missing  x10
   5 β”‚      4     40  missing  missing        -500  x10

a sketch for a generic function that does this set of operations

function compare(old, new, oldnames, newnames)
    rename!(old,Pair.(oldnames, newnames))
    aon=antijoin(old, new, on= newnames)
    ano=antijoin(new, old, on= newnames)
    ij=innerjoin(new, old, on= newnames)
    vcat(ij, aon, ano, cols = :union)

There isn’t a single function for comparison in InMemoryDatasets. You need a combination of join functions, filter, and custom code for solving your problem. Check also this.

thank you for your answers. I opened a request to InMemoryDatasets look

The comparison can now be done with the compare function in this package, too.

julia> eq_fun(x::Number, y::Number) = abs(x - y) <= 50

julia> eq_fun(x::AbstractString, y::AbstractString) = isequal(x,y)

julia> eq_fun(x,y) = missing

julia> res = compare(old, new, 
                     on = [:Insurance_Id => :Ins_Id, :Business_Id => :B_Id],
                     cols = [:Amount => :AMT, :Account_Id => :Ac_Id],
                     eq = eq_fun)
7Γ—6 Dataset
 Row β”‚ Insurance_Id  Business_Id  obs_id_left  obs_id_right  Amount=>AMT  Account_Id=>Ac_Id 
     β”‚ identity      identity     identity     identity      identity     identity          
     β”‚ Int64?        Int64?       Int32?       Int32?        Bool?        Bool?             
   1 β”‚            1           10            1             1         true               true
   2 β”‚            2           20            2             6        false              false
   3 β”‚            3           30            3             5         true              false
   4 β”‚            5           50            4       missing      missing            missing 
   5 β”‚            2           30      missing             3      missing            missing 
   6 β”‚            3           40      missing             2      missing            missing 
   7 β”‚            4           40      missing             4      missing            missing 

The speed of development of the new features is truly remarkable. :clap:
The function seems quite complex. I can’t understand the role of the third method eq_fun (x, y) = missing.
Even without it or defining it in a different way (for instance, eq_fun (x, y) = false), I get the same result.