Comparing two lists

I have two lists that I need to compare. The second list contains names of groups for the first list:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
list_compare = leftjoin(list1,list2, on = :ID)

I use leftjoin to produce the group / pair name in the first list.

julia> list_compare
8Γ—2 DataFrame
 Row β”‚ ID              Pair    
     β”‚ String          String? 
─────┼─────────────────────────
   1 β”‚ 0001P:ABCD      PairOne
   2 β”‚ 0001R:ABCD      PairOne
   3 β”‚ 2324R:CDCL      Two
   4 β”‚ 5666TB:GHLA     Two
   5 β”‚ 2324P:CDCL      missing 
   6 β”‚ 789789TSB:ASLT  missing 
   7 β”‚ 79067P:OPRA     missing 
   8 β”‚ 79067R:OPRA     missing 

From here, I want to produce list_comapre2.

There is one complexity. There are some IDs (in the first list) which are related-they have same characters except that one of them contains P: and other R: in the ID. Examples are rows 1 & 2, 3 & 5 and 6 & 7. list2 may specify group name for one of these but not necessarily for other e.g., ID 2324R:CDCL is part of Pair Two but it’s related ID 2323P:CDCL is not listed as under this group. I want to add another column Pair_1 to the combined dataframe that adds group name for the ID missing from list 2, so that I end up with list2_compare2

list_compare2 = list_compare
list_compare2.Pair_1 = ["PairOne","PairOne","Two","Two","Two","789789TSB:ASLT","79067P:OPRA&79067R:OPRA","79067P:OPRA&79067R:OPRA"]
list_compare2.Pair_original = ["True","True","True","True","True","False","False","False"]
julia> list_compare2
8Γ—4 DataFrame
 Row β”‚ ID              Pair     Pair_1                   Pair_original 
     β”‚ String          String?  String                   String        
─────┼─────────────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD      PairOne  PairOne                  True
   2 β”‚ 0001R:ABCD      PairOne  PairOne                  True
   3 β”‚ 2324R:CDCL      Two      Two                      True
   4 β”‚ 5666TB:GHLA     Two      Two                      True
   5 β”‚ 2324P:CDCL      missing  Two                      True
   6 β”‚ 789789TSB:ASLT  missing  789789TSB:ASLT           False
   7 β”‚ 79067P:OPRA     missing  79067P:OPRA&79067R:OPRA  False
   8 β”‚ 79067R:OPRA     missing  79067P:OPRA&79067R:OPRA  False

Summary:

  • for all IDs, I want to add an entry in Pair_1. if an ID containing P: or R: has a reported Pair, but not for both of them then I would like to add create one for which it is missing as explained above
  • If an ID contains P: or R: and both of these are reported as missing in Pair column then I would like to create an entry for them by concatenating their IDs (e.g. rows 7 and 8 above).
  • if an ID does not contain either P: or R: and Pair is reported as missing for it then its ID is the entry under Pair_1.
  • Final column set outs whether the pair existed in the original list.

I tried the following to get the relevant IDs, but I am confused on next steps.

p_list = filter(s->occursin(r"P:",s),list1.ID)
r_list = filter(s->occursin(r"R:",s),list1.ID)

Any views on how I can produce list_compare2 from list1 and list2?

One note, this line isn’t doing what you probably think it’s doing. In Julia, assignment does not automatically copy, so you are still modifying list_compare when you add columns to list_compare2.

I’m not totally following what you want, unfortunately. I think I would do separate joins, one for a β€œstrict” comparison and one where you have gotten rid of the P: and R:s that are causing problems.

I don’t want to get rid of P: and R:. If the above sound confusing, I can give on example on what I am trying to do. ID 2324R:CDCL has Pair listed Two under the Pair column. It has a related ID 2324P:CDCL which is not listed as Pair Two. I want to create an entry of Two in Pair_1 column for 2324P:CDCL . So that I end up with Two in Pair column for both of them. Does that make sense?

Yes. I’m not proposing deleting that information entirely. But if two things are a pseudo-match, then you should be able to, at least temporarily, transform the pseudo-matches so they are exact matches.

The rules to be applied are not precisely linear (perhaps setting the previous steps differently, it would not be necessary to do these stunts), but, waiting for a β€œclean” solution, maybe you can settle for a little patched one.

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end

list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list1.ID[i],list1.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[t[2]  !=  nothing ? coalesce(list.Pair[t[1]],list.Pair[t[2]])  : coalesce(list.Pair[t[1]],list.ID[t[1]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]
julia> list
8Γ—5 DataFrame
 Row β”‚ ID              Pair     twin    pair1    pair_1
     β”‚ String          String?  Array…  String?  String
─────┼─────────────────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD      PairOne  [1, 2]  PairOne  PairOne
   2 β”‚ 0001R:ABCD      PairOne  [2, 1]  PairOne  PairOne
   3 β”‚ 2324P:CDCL      missing  [3, 4]  Two      Two
   4 β”‚ 2324R:CDCL      Two      [4, 3]  Two      Two
   5 β”‚ 5666TB:GHLA     Two      [5, 5]  Two      Two
   6 β”‚ 789789TSB:ASLT  missing  [6, 6]  missing  789789TSB:ASLT
   7 β”‚ 79067P:OPRA     missing  [7, 8]  missing  79067P:OPRA & 79067R:OPRA
   8 β”‚ 79067R:OPRA     missing  [8, 7]  missing  79067P:OPRA & 79067R:OPRA
1 Like

Thank you. Unless I am missing something, but I don’t get the same results as the one you have shown. For 2324P:CDCL, pair_1 is showing up as 2324P:CDCL

Here is the complete code:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end
list = leftjoin(list1,list2, on = :ID)
list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list1.ID[i],list1.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[t[2]  !=  nothing ? coalesce(list.Pair[t[1]],list.Pair[t[2]])  : coalesce(list.Pair[t[1]],list.ID[t[1]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]
julia> list
8Γ—5 DataFrame
 Row β”‚ ID              Pair     twin    pair1    pair_1
     β”‚ String          String?  Array…  String?  String
─────┼─────────────────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD      PairOne  [1, 2]  PairOne  PairOne
   2 β”‚ 0001R:ABCD      PairOne  [2, 1]  PairOne  PairOne
   3 β”‚ 2324R:CDCL      Two      [3, 4]  Two      Two
   4 β”‚ 5666TB:GHLA     Two      [4, 3]  Two      Two
   5 β”‚ 2324P:CDCL      missing  [5, 5]  missing  2324P:CDCL
   6 β”‚ 789789TSB:ASLT  missing  [6, 6]  missing  789789TSB:ASLT
   7 β”‚ 79067P:OPRA     missing  [7, 8]  missing  79067P:OPRA & 79067R:OPRA
   8 β”‚ 79067R:OPRA     missing  [8, 7]  missing  79067P:OPRA & 79067R:OPRA

Nice challenge :slight_smile: Here’s another solution.

using DataFrames

list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair=["PairOne","PairOne","Two","Two"])

l = leftjoin(list1, list2, on=:ID, source=:Pair_original)

We define one helper function for parsing the IDs:

function parse_id(id)
    i = findfirst(':', id)
    return (type=id[[i-1]], untyped=id[1:i-2] * '_' * id[i:end])
end

julia> l2 = transform(l, :ID => ByRow(parse_id) => AsTable,
                         :Pair_original => c->c.=="both", renamecols=false)
8Γ—5 DataFrame
 Row β”‚ ID              Pair     Pair_original  type    untyped        
     β”‚ String          String?  Bool           String  String         
─────┼────────────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD      PairOne           true  P       0001_:ABCD
   2 β”‚ 0001R:ABCD      PairOne           true  R       0001_:ABCD
   3 β”‚ 2324R:CDCL      Two               true  R       2324_:CDCL
   4 β”‚ 5666TB:GHLA     Two               true  B       5666T_:GHLA
   5 β”‚ 2324P:CDCL      missing          false  P       2324_:CDCL
   6 β”‚ 789789TSB:ASLT  missing          false  B       789789TS_:ASLT
   7 β”‚ 79067P:OPRA     missing          false  P       79067_:OPRA
   8 β”‚ 79067R:OPRA     missing          false  R       79067_:OPRA

and one for choosing the pair value using data from all relevant rows:

make_pair(ids, pairs) = coalesce(pairs..., join(ids, " & "));

julia> select(groupby(l2, :untyped), :ID, :Pair_original,
              [:ID, :Pair] => make_pair => :Pair, keepkeys=false)
8Γ—3 DataFrame
 Row β”‚ ID              Pair_original  Pair                      
     β”‚ String          Bool           String                    
─────┼──────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD               true  PairOne
   2 β”‚ 0001R:ABCD               true  PairOne
   3 β”‚ 2324R:CDCL               true  Two
   4 β”‚ 5666TB:GHLA              true  Two
   5 β”‚ 2324P:CDCL              false  Two
   6 β”‚ 789789TSB:ASLT          false  789789TSB:ASLT
   7 β”‚ 79067P:OPRA             false  79067P:OPRA & 79067R:OPRA
   8 β”‚ 79067R:OPRA             false  79067P:OPRA & 79067R:OPRA

The parse_id function stores the β€œtype” character (R/P/B) in a column but it’s not used actually, so you could simplify the function to just return (untyped=id[1:i-2] * '_' * id[i:end],).

Edit: the solution above assigns the same β€œpair” to all related IDs. If it is possible that related IDs have pairs that are different (not missing), you can adapt make_pairs like this:

make_pair(ids, pairs) = coalesce.(pairs, pairs..., join(ids, " & "))
2 Likes

I’m sorry but I can’t reproduce your result.
This is what I get by copying and pasting the code used by you.
Try starting from a clean situation, restarting Julia.

Even if list1 is not sorted, the result appears to be as expected. Or not?

I have checked on a fresh Julia session, and I still get different results. My version of DataFrames may be different from yours which may result in different results

Thank you

I do not understand what this part of the code is doing. Could you please explain?

It’s creating a named tuple. You can execute each part independently to see what it’s doing:

julia> id = "0001P:ABCD";

julia> i = findfirst(':', id)
6

julia> id[[i-1]]
"P"

julia> id[1:i-2]
"0001"

julia> id[i:end]
":ABCD"

julia> (type=id[[i-1]], untyped=id[1:i-2] * '_' * id[i:end])
(type = "P", untyped = "0001_:ABCD")

Some remarks:

  • I used id[[i-1]] instead of id[i-1] to get a string "P" rather than a char 'P' (I needed a string when experimenting with another solution, but in the present solution this type value is not used so it doesn’t matter).

  • * is the concatenation operator for strings:

     julia> id[1:i-2] * '_' * id[i:end]
     "0001_:ABCD"
    
  • The last line is making the named tuple, same syntax as (a=1, b=2).

2 Likes

try this

list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])

function pairing(s1,s2)
    if length(s1)==length(s2)
        idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
        return Set([s1[idx],s2[idx]])
    else
        return Set([])
    end
end
list = leftjoin(list1,list2, on = :ID)
list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:8]),i)] for i in 1:8]
list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
list.pair_1=[coalesce(list.pair1[r],join(Set(list.ID[list.twin[r]])," & ")) for r in 1:8 ]
1 Like

Thanks. This works

Thanks for the explanation. I just noticed that it is concatenating some IDs which do not contain P: or R: on another dataset. If you have two IDs that have most characters same except one e.g. 222367_1:B421_GHL and 222367_2:B421_GHL then it creates a pair
222367_1:B421_GHL & 222367_2:B421_GHL

Is there anything I am doing incorrectly?

This is another solution: (your last column is not clear since for example 2324P:CDCL was not in original data but it is marked as true)

# list2 ID and Pair
ref = Dict(list2[!, 1] .=> list2[!, 2])

# rules:
contain_PR(x) = occursin("P:", x) || occursin("R:", x)
change_PR(x) = occursin("P:", x) ? replace(x, "P:"=>"R:") : replace(x, "R:"=>"P:")
concat_them(x) = occursin("P:", x) ? x*change_PR(x) : change_PR(x)*x
function compute_pair_1(x, y; ref = ref)
    if ismissing(y) && contain_PR(x)
        return get(ref, change_PR(x), concat_them(x))
    elseif ismissing(y)
        return x
    else
        return y
    end
end


# apply the function
transform(list_compare, [:ID, :Pair] => ((x,y) -> compute_pair_1.(x, y, ref = ref)) => :Pair_1)
2 Likes

Yes, I was not sure what was desired in this case. Here’s a version that gives special treatment only for P and R:

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair=["PairOne","PairOne","Two","Two"])

l = leftjoin(list1, list2, on=:ID, source=:source)
l.Pair_original = l.source .== "both"

function id_family(id)
    i = findfirst(':', id)
    if id[i-1] in "PR"
        return id[1:i-2] * '_' * id[i:end]
    else
        return id
    end
end

l.family = id_family.(l.ID)

make_pair(ids, pairs) = coalesce.(pairs, pairs..., join(ids, " & "))

select(groupby(l, :family), :ID, :Pair_original,
       [:ID, :Pair] => make_pair => :Pair, keepkeys=false)
2 Likes

Thank you

2324P:CDCL is missing as under Pair, but 2324R:CDCL has Two under Pair. If only one of the related ID has a reported Pair then for the other that one automatically applies. All IDs which have same characters except P: or R: are related.

Your code is not creating all the pairs of P: and R: where both of them are reported as missing under Pair

Make sure that you didn’t miss any code (β€œ.” are important)

using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
list_compare = leftjoin(list1,list2, on = :ID)
# list2 ID and Pair
ref = Dict(list2[!, 1] .=> list2[!, 2])

# rules:
contain_PR(x) = occursin("P:", x) || occursin("R:", x)
change_PR(x) = occursin("P:", x) ? replace(x, "P:"=>"R:") : replace(x, "R:"=>"P:")
concat_them(x) = occursin("P:", x) ? x*change_PR(x) : change_PR(x)*x
function compute_pair_1(x, y; ref = ref)
    if ismissing(y) && contain_PR(x)
        return get(ref, change_PR(x), concat_them(x))
    elseif ismissing(y)
        return x
    else
        return y
    end
end
function compute_original(x, y; ref = ref)
       if ismissing(y)
            contain_PR(x) && haskey(ref, change_PR(x)) ? true : false
       else
            return true
        end
end

# apply the function
transform(list_compare, 
                 [:ID, :Pair] => ((x,y) -> compute_pair_1.(x, y, ref = ref)) => :Pair_1, 
                 [:ID, :Pair] => ((x,y) -> compute_original.(x, y, ref = ref)) => :original)
8Γ—4 DataFrame
 Row β”‚ ID              Pair     Pair_1                  original 
     β”‚ String          String?  String                  Bool     
─────┼───────────────────────────────────────────────────────────
   1 β”‚ 0001P:ABCD      PairOne  PairOne                     true
   2 β”‚ 0001R:ABCD      PairOne  PairOne                     true
   3 β”‚ 2324R:CDCL      Two      Two                         true
   4 β”‚ 5666TB:GHLA     Two      Two                         true
   5 β”‚ 2324P:CDCL      missing  Two                         true
   6 β”‚ 789789TSB:ASLT  missing  789789TSB:ASLT             false
   7 β”‚ 79067P:OPRA     missing  79067P:OPRA79067R:OPRA     false
   8 β”‚ 79067R:OPRA     missing  79067P:OPRA79067R:OPRA     false