Compare 2 data sets - similar to SAS proc compare

ab2z · April 8, 2022, 9:07am

Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?
I could not edit this Comparison 2 data sets , So I decide to open this new topic.

I have the following example

old=Dataset(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],Amount=[100,200,300,missing],Account_Id=["x1","x10","x5","x5"])

new=Dataset(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

The combination of insurance_id*Business_Id gives a unique id.

The delta is on the amount (absolute difference greater than 50) and the Account_Id.

The only matches here are (insurance_id,Business_Id) in {(1,10)} so I expect to have all other combinations.

rocco_sprmnt21 · April 9, 2022, 2:20pm

is this the result you want?

using DataFrames
old=DataFrame(Insurance_Id=[1,2,3,5],Business_Id=[10,20,30,50],Amount=[100,200,300,missing],Account_Id=["x1","x10","x5","x5"])

new=DataFrame(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,missing,-500,350,700],Ac_Id=["x1","x1","x10","x10","x7","x5"])

aon=antijoin(old, new, on= [:Insurance_Id => :Ins_Id,:Business_Id=>:B_Id])
ano=antijoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])
rename!(aon,:Insurance_Id => :Ins_Id ,:Business_Id=>:B_Id)

ij=innerjoin(new, old, on= [:Ins_Id =>:Insurance_Id,:B_Id => :Business_Id])

fij=filter(r->r.AMT-r.Amount>50,ij)

vcat(fij,aon,ano, cols= :union)
5×6 DataFrame
 Row │ Ins_Id  B_Id   Amount   Account_Id  AMT      Ac_Id   
     │ Int64   Int64  Int64?   String?     Int64?   String?
─────┼──────────────────────────────────────────────────────
   1 │      2     20      200  x10             700  x5
   2 │      5     50  missing  x5          missing  missing
   3 │      3     40  missing  missing         200  x1
   4 │      2     30  missing  missing     missing  x10
   5 │      4     40  missing  missing        -500  x10

a sketch for a generic function that does this set of operations


function compare(old, new, oldnames, newnames)
    rename!(old,Pair.(oldnames, newnames))
    aon=antijoin(old, new, on= newnames)
    ano=antijoin(new, old, on= newnames)
    ij=innerjoin(new, old, on= newnames)
    vcat(ij, aon, ano, cols = :union)
end

DataFrames · April 10, 2022, 2:10am

There isn’t a single function for comparison in InMemoryDatasets. You need a combination of join functions, filter, and custom code for solving your problem. Check also this.

ab2z · April 11, 2022, 7:17am

thank you for your answers. I opened a request to InMemoryDatasets look
https://github.com/sl-solution/InMemoryDatasets.jl/issues/53
.
The comparison can now be done with the compare function in this package, too.

julia> eq_fun(x::Number, y::Number) = abs(x - y) <= 50

julia> eq_fun(x::AbstractString, y::AbstractString) = isequal(x,y)

julia> eq_fun(x,y) = missing

julia> res = compare(old, new, 
                     on = [:Insurance_Id => :Ins_Id, :Business_Id => :B_Id],
                     cols = [:Amount => :AMT, :Account_Id => :Ac_Id],
                     eq = eq_fun)
7×6 Dataset
 Row │ Insurance_Id  Business_Id  obs_id_left  obs_id_right  Amount=>AMT  Account_Id=>Ac_Id 
     │ identity      identity     identity     identity      identity     identity          
     │ Int64?        Int64?       Int32?       Int32?        Bool?        Bool?             
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │            1           10            1             1         true               true
   2 │            2           20            2             6        false              false
   3 │            3           30            3             5         true              false
   4 │            5           50            4       missing      missing            missing 
   5 │            2           30      missing             3      missing            missing 
   6 │            3           40      missing             2      missing            missing 
   7 │            4           40      missing             4      missing            missing

rocco_sprmnt21 · April 11, 2022, 6:02pm

The speed of development of the new features is truly remarkable.
The function seems quite complex. I can’t understand the role of the third method eq_fun (x, y) = missing.
Even without it or defining it in a different way (for instance, eq_fun (x, y) = false), I get the same result.

Topic		Replies	Views
Comparison 2 data sets Finance and Economics testing , data	6	798	April 5, 2022
Is there a package to do DataFrames comparisons? General Usage dataframes	4	1305	August 3, 2021
How to supplement a function for equality check when joining the DataFrames Data dataframes	2	86	July 5, 2024
Set-like operations on DataFrames? Data question , dataframes	13	2560	February 1, 2023
Differences in Two DataFrames General Usage dataframes	5	896	November 11, 2022

Compare 2 data sets - similar to SAS proc compare

Related topics