Comparison 2 data sets

Hi all,
I am new in Julia so I would like to start with a simple question.
In testing the migration activities from old to a new system/setting/etc, we mostly need to compare 2 output data sets and extract the deltas.
Question: what is the best command for comparing 2 datasets - something like proc compare in SAS ?

Thanks+regards

It would be helpful if you could provide a bit more detail on what you want to happen - I’m not familiar with SAS. Something like:

julia> using DataFrames

julia> df1 = DataFrame(rand(1:5, 10, 5), :auto); df2 = DataFrame(rand(1:5, 10, 5), :auto)
10Γ—5 DataFrame
 Row β”‚ x1     x2     x3     x4     x5
     β”‚ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 β”‚     4      5      1      3      3
   2 β”‚     1      4      1      3      1
   3 β”‚     5      2      3      4      3
   4 β”‚     2      2      3      3      5
   5 β”‚     1      4      4      2      1
   6 β”‚     4      3      1      2      1
   7 β”‚     5      3      5      1      1
   8 β”‚     4      3      3      4      3
   9 β”‚     4      2      4      3      2
  10 β”‚     1      4      4      2      2

julia> df1 .== df2
10Γ—5 DataFrame
 Row β”‚ x1     x2     x3     x4     x5
     β”‚ Bool   Bool   Bool   Bool   Bool
─────┼───────────────────────────────────
   1 β”‚ false   true   true   true  false
   2 β”‚ false  false   true   true   true
   3 β”‚ false   true  false  false  false
   4 β”‚ false   true  false  false  false
   5 β”‚ false   true  false  false   true
   6 β”‚ false  false   true   true  false
   7 β”‚ false  false  false  false  false
   8 β”‚  true  false  false   true  false
   9 β”‚ false  false  false  false  false
  10 β”‚ false   true  false  false  false

julia> df1 .- df2
10Γ—5 DataFrame
 Row β”‚ x1     x2     x3     x4     x5
     β”‚ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 β”‚    -2      0      0      0     -2
   2 β”‚     3     -2      0      0      0
   3 β”‚    -3      0     -2      1      1
   4 β”‚     3      0      1     -2     -2
   5 β”‚     2      0     -2     -1      0
   6 β”‚    -2      2      0      0      2
   7 β”‚    -2      1     -4      4      4
   8 β”‚     0      2     -1      0     -1
   9 β”‚    -1     -1     -1     -1      2
  10 β”‚     1      0     -2      2      1
1 Like

Assuming you are talking about arrays or simple tables, a solution is to use DeepDiffs.jl:

julia> tbl_x = [(a=1, b=2), (a=3, b=4)]
julia> tbl_y = [(a=1, b=3), (a=3, b=4)]
julia> using Tables, DeepDiffs
julia> map(deepdiff, tbl_x |> columntable, tbl_y |> columntable)

gives the following colored output:
image

2 Likes

I have the following example

old=Dataset(Insurance_Id=[1,2,3],Business_Id=[10,20,30],Amount=[100,200,300])

new=Dataset(Ins_Id=[1,3,2,4,3,2],B_Id=[10,40,30,40,30,20],AMT=[100,200,300,-500,350,700])

The combination of insurance_id*Business_Id gives a unique id.

The delta is on the amount absolute difference greater than 50.

The only matches here are (insurance_id,Business_Id) in {(1,10),(3,30)} so I expect to have all other combinations.

My initial thought is to use innerjoin on [:Insurance_Id => Ins_Id, :Business_Id=>:B_Id](using DataFrames.jl) and then compare the two amounts using filter function that are implemented in TableOperations.jl or other table packages.
(I think you would like to post this in data topic, as this is something related to table operation, but one function to compare two tables with different shapes seems no exist right now)

Please include code inside of backticks (```) so that it’s easier to read. Here’s how to do what Frankiewaang suggested:

using DataFrames

old = DataFrame(Insurance_Id = [1,2,3], Business_Id = [10,20,30], Amount = [100,200,300])

new = DataFrame(Ins_Id = [1,3,2,4,3,2], B_Id = [10,40,30,40,30,20], AMT = [100,200,300,-500,350,700])

joined = innerjoin(old, new, on=[:Insurance_Id => :Ins_Id, :Business_Id => :B_Id])

diffs = filter(row -> abs(row.Amount - row.AMT) > 50, joined)

1Γ—4 DataFrame
 Row β”‚ Insurance_Id  Business_Id  Amount  AMT   
     β”‚ Int64         Int64        Int64   Int64 
─────┼──────────────────────────────────────────
   1 β”‚            2           20     200    700
1 Like

I can not edit my post and put the code in backticks. maybe next time.