Comparing two lists

mjanun · May 12, 2021, 10:59pm

I have two lists that I need to compare. The second list contains names of groups for the first list:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
list_compare = leftjoin(list1,list2, on = :ID)

I use leftjoin to produce the group / pair name in the first list.

julia> list_compare
8×2 DataFrame
 Row │ ID              Pair    
     │ String          String? 
─────┼─────────────────────────
   1 │ 0001P:ABCD      PairOne
   2 │ 0001R:ABCD      PairOne
   3 │ 2324R:CDCL      Two
   4 │ 5666TB:GHLA     Two
   5 │ 2324P:CDCL      missing 
   6 │ 789789TSB:ASLT  missing 
   7 │ 79067P:OPRA     missing 
   8 │ 79067R:OPRA     missing

From here, I want to produce list_comapre2.

There is one complexity. There are some IDs (in the first list) which are related-they have same characters except that one of them contains P: and other R: in the ID. Examples are rows 1 & 2, 3 & 5 and 6 & 7. list2 may specify group name for one of these but not necessarily for other e.g., ID 2324R:CDCL is part of Pair Two but it’s related ID 2323P:CDCL is not listed as under this group. I want to add another column Pair_1 to the combined dataframe that adds group name for the ID missing from list 2, so that I end up with list2_compare2

list_compare2 = list_compare
list_compare2.Pair_1 = ["PairOne","PairOne","Two","Two","Two","789789TSB:ASLT","79067P:OPRA&79067R:OPRA","79067P:OPRA&79067R:OPRA"]
list_compare2.Pair_original = ["True","True","True","True","True","False","False","False"]

julia> list_compare2
8×4 DataFrame
 Row │ ID              Pair     Pair_1                   Pair_original 
     │ String          String?  String                   String        
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0001P:ABCD      PairOne  PairOne                  True
   2 │ 0001R:ABCD      PairOne  PairOne                  True
   3 │ 2324R:CDCL      Two      Two                      True
   4 │ 5666TB:GHLA     Two      Two                      True
   5 │ 2324P:CDCL      missing  Two                      True
   6 │ 789789TSB:ASLT  missing  789789TSB:ASLT           False
   7 │ 79067P:OPRA     missing  79067P:OPRA&79067R:OPRA  False
   8 │ 79067R:OPRA     missing  79067P:OPRA&79067R:OPRA  False

Summary:

for all IDs, I want to add an entry in Pair_1. if an ID containing P: or R: has a reported Pair, but not for both of them then I would like to add create one for which it is missing as explained above
If an ID contains P: or R: and both of these are reported as missing in Pair column then I would like to create an entry for them by concatenating their IDs (e.g. rows 7 and 8 above).
if an ID does not contain either P: or R: and Pair is reported as missing for it then its ID is the entry under Pair_1.
Final column set outs whether the pair existed in the original list.

I tried the following to get the relevant IDs, but I am confused on next steps.

p_list = filter(s->occursin(r"P:",s),list1.ID)
r_list = filter(s->occursin(r"R:",s),list1.ID)

Any views on how I can produce list_compare2 from list1 and list2?

pdeffebach · May 12, 2021, 11:04pm

One note, this line isn’t doing what you probably think it’s doing. In Julia, assignment does not automatically copy, so you are still modifying list_compare when you add columns to list_compare2.

I’m not totally following what you want, unfortunately. I think I would do separate joins, one for a “strict” comparison and one where you have gotten rid of the P: and R:s that are causing problems.

mjanun · May 12, 2021, 11:13pm

I don’t want to get rid of P: and R:. If the above sound confusing, I can give on example on what I am trying to do. ID 2324R:CDCL has Pair listed Two under the Pair column. It has a related ID 2324P:CDCL which is not listed as Pair Two. I want to create an entry of Two in Pair_1 column for 2324P:CDCL . So that I end up with Two in Pair column for both of them. Does that make sense?

pdeffebach · May 12, 2021, 11:16pm

Yes. I’m not proposing deleting that information entirely. But if two things are a pseudo-match, then you should be able to, at least temporarily, transform the pseudo-matches so they are exact matches.

rocco_sprmnt21 · May 17, 2021, 6:44am

The rules to be applied are not precisely linear (perhaps setting the previous steps differently, it would not be necessary to do these stunts), but, waiting for a “clean” solution, maybe you can settle for a little patched one.

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end

list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list1.ID[i],list1.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[t[2]  !=  nothing ? coalesce(list.Pair[t[1]],list.Pair[t[2]])  : coalesce(list.Pair[t[1]],list.ID[t[1]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]

julia> list
8×5 DataFrame
 Row │ ID              Pair     twin    pair1    pair_1
     │ String          String?  Array…  String?  String
─────┼─────────────────────────────────────────────────────────────────────
   1 │ 0001P:ABCD      PairOne  [1, 2]  PairOne  PairOne
   2 │ 0001R:ABCD      PairOne  [2, 1]  PairOne  PairOne
   3 │ 2324P:CDCL      missing  [3, 4]  Two      Two
   4 │ 2324R:CDCL      Two      [4, 3]  Two      Two
   5 │ 5666TB:GHLA     Two      [5, 5]  Two      Two
   6 │ 789789TSB:ASLT  missing  [6, 6]  missing  789789TSB:ASLT
   7 │ 79067P:OPRA     missing  [7, 8]  missing  79067P:OPRA & 79067R:OPRA
   8 │ 79067R:OPRA     missing  [8, 7]  missing  79067P:OPRA & 79067R:OPRA

mjanun · May 17, 2021, 11:00am

Thank you. Unless I am missing something, but I don’t get the same results as the one you have shown. For 2324P:CDCL, pair_1 is showing up as 2324P:CDCL

Here is the complete code:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end
list = leftjoin(list1,list2, on = :ID)
list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list1.ID[i],list1.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[t[2]  !=  nothing ? coalesce(list.Pair[t[1]],list.Pair[t[2]])  : coalesce(list.Pair[t[1]],list.ID[t[1]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]

julia> list
8×5 DataFrame
 Row │ ID              Pair     twin    pair1    pair_1
     │ String          String?  Array…  String?  String
─────┼─────────────────────────────────────────────────────────────────────
   1 │ 0001P:ABCD      PairOne  [1, 2]  PairOne  PairOne
   2 │ 0001R:ABCD      PairOne  [2, 1]  PairOne  PairOne
   3 │ 2324R:CDCL      Two      [3, 4]  Two      Two
   4 │ 5666TB:GHLA     Two      [4, 3]  Two      Two
   5 │ 2324P:CDCL      missing  [5, 5]  missing  2324P:CDCL
   6 │ 789789TSB:ASLT  missing  [6, 6]  missing  789789TSB:ASLT
   7 │ 79067P:OPRA     missing  [7, 8]  missing  79067P:OPRA & 79067R:OPRA
   8 │ 79067R:OPRA     missing  [8, 7]  missing  79067P:OPRA & 79067R:OPRA

sijo · May 17, 2021, 11:01am

Nice challenge Here’s another solution.

using DataFrames

list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair=["PairOne","PairOne","Two","Two"])

l = leftjoin(list1, list2, on=:ID, source=:Pair_original)

We define one helper function for parsing the IDs:

function parse_id(id)
    i = findfirst(':', id)
    return (type=id[[i-1]], untyped=id[1:i-2] * '_' * id[i:end])
end

julia> l2 = transform(l, :ID => ByRow(parse_id) => AsTable,
                         :Pair_original => c->c.=="both", renamecols=false)
8×5 DataFrame
 Row │ ID              Pair     Pair_original  type    untyped        
     │ String          String?  Bool           String  String         
─────┼────────────────────────────────────────────────────────────────
   1 │ 0001P:ABCD      PairOne           true  P       0001_:ABCD
   2 │ 0001R:ABCD      PairOne           true  R       0001_:ABCD
   3 │ 2324R:CDCL      Two               true  R       2324_:CDCL
   4 │ 5666TB:GHLA     Two               true  B       5666T_:GHLA
   5 │ 2324P:CDCL      missing          false  P       2324_:CDCL
   6 │ 789789TSB:ASLT  missing          false  B       789789TS_:ASLT
   7 │ 79067P:OPRA     missing          false  P       79067_:OPRA
   8 │ 79067R:OPRA     missing          false  R       79067_:OPRA

and one for choosing the pair value using data from all relevant rows:

make_pair(ids, pairs) = coalesce(pairs..., join(ids, " & "));

julia> select(groupby(l2, :untyped), :ID, :Pair_original,
              [:ID, :Pair] => make_pair => :Pair, keepkeys=false)
8×3 DataFrame
 Row │ ID              Pair_original  Pair                      
     │ String          Bool           String                    
─────┼──────────────────────────────────────────────────────────
   1 │ 0001P:ABCD               true  PairOne
   2 │ 0001R:ABCD               true  PairOne
   3 │ 2324R:CDCL               true  Two
   4 │ 5666TB:GHLA              true  Two
   5 │ 2324P:CDCL              false  Two
   6 │ 789789TSB:ASLT          false  789789TSB:ASLT
   7 │ 79067P:OPRA             false  79067P:OPRA & 79067R:OPRA
   8 │ 79067R:OPRA             false  79067P:OPRA & 79067R:OPRA

The parse_id function stores the “type” character (R/P/B) in a column but it’s not used actually, so you could simplify the function to just return (untyped=id[1:i-2] * '_' * id[i:end],).

Edit: the solution above assigns the same “pair” to all related IDs. If it is possible that related IDs have pairs that are different (not missing), you can adapt make_pairs like this:

make_pair(ids, pairs) = coalesce.(pairs, pairs..., join(ids, " & "))

rocco_sprmnt21 · May 17, 2021, 12:42pm

I’m sorry but I can’t reproduce your result.
This is what I get by copying and pasting the code used by you.
Try starting from a clean situation, restarting Julia.

Even if list1 is not sorted, the result appears to be as expected. Or not?

mjanun · May 17, 2021, 4:23pm

I have checked on a fresh Julia session, and I still get different results. My version of DataFrames may be different from yours which may result in different results

mjanun · May 17, 2021, 4:24pm

Thank you

I do not understand what this part of the code is doing. Could you please explain?

sijo · May 17, 2021, 4:44pm

It’s creating a named tuple. You can execute each part independently to see what it’s doing:

julia> id = "0001P:ABCD";

julia> i = findfirst(':', id)
6

julia> id[[i-1]]
"P"

julia> id[1:i-2]
"0001"

julia> id[i:end]
":ABCD"

julia> (type=id[[i-1]], untyped=id[1:i-2] * '_' * id[i:end])
(type = "P", untyped = "0001_:ABCD")

Some remarks:

I used id[[i-1]] instead of id[i-1] to get a string "P" rather than a char 'P' (I needed a string when experimenting with another solution, but in the present solution this type value is not used so it doesn’t matter).

* is the concatenation operator for strings:

 julia> id[1:i-2] * '_' * id[i:end]
 "0001_:ABCD"

The last line is making the named tuple, same syntax as (a=1, b=2).

rocco_sprmnt21 · May 17, 2021, 5:00pm

try this

list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end
list = leftjoin(list1,list2, on = :ID)
list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]

mjanun · May 17, 2021, 5:14pm

rocco_sprmnt21:

list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end
list = leftjoin(list1,list2, on = :ID)
list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]

Thanks. This works

mjanun · May 17, 2021, 10:42pm

Thanks for the explanation. I just noticed that it is concatenating some IDs which do not contain P: or R: on another dataset. If you have two IDs that have most characters same except one e.g. 222367_1:B421_GHL and 222367_2:B421_GHL then it creates a pair
222367_1:B421_GHL & 222367_2:B421_GHL

Is there anything I am doing incorrectly?

DataFrames · May 18, 2021, 12:47am

This is another solution: (your last column is not clear since for example 2324P:CDCL was not in original data but it is marked as true)

# list2 ID and Pair
ref = Dict(list2[!, 1] .=> list2[!, 2])

# rules:
contain_PR(x) = occursin("P:", x) || occursin("R:", x)
change_PR(x) = occursin("P:", x) ? replace(x, "P:"=>"R:") : replace(x, "R:"=>"P:")
concat_them(x) = occursin("P:", x) ? x*change_PR(x) : change_PR(x)*x
function compute_pair_1(x, y; ref = ref)
    if ismissing(y) && contain_PR(x)
        return get(ref, change_PR(x), concat_them(x))
    elseif ismissing(y)
        return x
    else
        return y
    end
end


# apply the function
transform(list_compare, [:ID, :Pair] => ((x,y) -> compute_pair_1.(x, y, ref = ref)) => :Pair_1)

sijo · May 18, 2021, 6:43am

Yes, I was not sure what was desired in this case. Here’s a version that gives special treatment only for P and R:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair=["PairOne","PairOne","Two","Two"])

l = leftjoin(list1, list2, on=:ID, source=:source)
l.Pair_original = l.source .== "both"

function id_family(id)
    i = findfirst(':', id)
    if id[i-1] in "PR"
        return id[1:i-2] * '_' * id[i:end]
    else
        return id
    end
end

l.family = id_family.(l.ID)

make_pair(ids, pairs) = coalesce.(pairs, pairs..., join(ids, " & "))

select(groupby(l, :family), :ID, :Pair_original,
       [:ID, :Pair] => make_pair => :Pair, keepkeys=false)

mjanun · May 18, 2021, 4:49pm

Thank you

2324P:CDCL is missing as under Pair, but 2324R:CDCL has Two under Pair. If only one of the related ID has a reported Pair then for the other that one automatically applies. All IDs which have same characters except P: or R: are related.

Your code is not creating all the pairs of P: and R: where both of them are reported as missing under Pair

DataFrames · May 19, 2021, 2:23am

Make sure that you didn’t miss any code (“.” are important)

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
list_compare = leftjoin(list1,list2, on = :ID)

# list2 ID and Pair
ref = Dict(list2[!, 1] .=> list2[!, 2])

# rules:
contain_PR(x) = occursin("P:", x) || occursin("R:", x)
change_PR(x) = occursin("P:", x) ? replace(x, "P:"=>"R:") : replace(x, "R:"=>"P:")
concat_them(x) = occursin("P:", x) ? x*change_PR(x) : change_PR(x)*x
function compute_pair_1(x, y; ref = ref)
    if ismissing(y) && contain_PR(x)
        return get(ref, change_PR(x), concat_them(x))
    elseif ismissing(y)
        return x
    else
        return y
    end
end
function compute_original(x, y; ref = ref)
       if ismissing(y)
            contain_PR(x) && haskey(ref, change_PR(x)) ? true : false
       else
            return true
        end
end

# apply the function
transform(list_compare, 
                 [:ID, :Pair] => ((x,y) -> compute_pair_1.(x, y, ref = ref)) => :Pair_1, 
                 [:ID, :Pair] => ((x,y) -> compute_original.(x, y, ref = ref)) => :original)

8×4 DataFrame
 Row │ ID              Pair     Pair_1                  original 
     │ String          String?  String                  Bool     
─────┼───────────────────────────────────────────────────────────
   1 │ 0001P:ABCD      PairOne  PairOne                     true
   2 │ 0001R:ABCD      PairOne  PairOne                     true
   3 │ 2324R:CDCL      Two      Two                         true
   4 │ 5666TB:GHLA     Two      Two                         true
   5 │ 2324P:CDCL      missing  Two                         true
   6 │ 789789TSB:ASLT  missing  789789TSB:ASLT             false
   7 │ 79067P:OPRA     missing  79067P:OPRA79067R:OPRA     false
   8 │ 79067R:OPRA     missing  79067P:OPRA79067R:OPRA     false

Topic		Replies	Views
Differences in Two DataFrames General Usage dataframes	5	887	November 11, 2022
How to compare non-missing elements of two DataFrames New to Julia	3	566	July 1, 2020
Is there a package to do DataFrames comparisons? General Usage dataframes	4	1295	August 3, 2021
Leftjoin with replacement Data dataframes	3	331	July 23, 2022
Leftjoin order of dataframe New to Julia dataframes	5	796	May 17, 2021

Comparing two lists

Related topics