I have two lists that I need to compare. The second list contains names of groups for the first list:
using DataFrames
list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","2324R:CDCL","5666TB:GHLA","789789TSB:ASLT","79067P:OPRA","79067R:OPRA"])
list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
list_compare = leftjoin(list1,list2, on = :ID)
I use leftjoin
to produce the group / pair name in the first list.
julia> list_compare
8Γ2 DataFrame
Row β ID Pair
β String String?
ββββββΌβββββββββββββββββββββββββ
1 β 0001P:ABCD PairOne
2 β 0001R:ABCD PairOne
3 β 2324R:CDCL Two
4 β 5666TB:GHLA Two
5 β 2324P:CDCL missing
6 β 789789TSB:ASLT missing
7 β 79067P:OPRA missing
8 β 79067R:OPRA missing
From here, I want to produce list_comapre2.
There is one complexity. There are some IDs (in the first list) which are related-they have same characters except that one of them contains P:
and other R:
in the ID. Examples are rows 1 & 2, 3 & 5 and 6 & 7. list2
may specify group name for one of these but not necessarily for other e.g., ID 2324R:CDCL is part of Pair
Two
but itβs related ID 2323P:CDCL is not listed as under this group. I want to add another column Pair_1
to the combined dataframe that adds group name for the ID missing from list 2, so that I end up with list2_compare2
list_compare2 = list_compare
list_compare2.Pair_1 = ["PairOne","PairOne","Two","Two","Two","789789TSB:ASLT","79067P:OPRA&79067R:OPRA","79067P:OPRA&79067R:OPRA"]
list_compare2.Pair_original = ["True","True","True","True","True","False","False","False"]
julia> list_compare2
8Γ4 DataFrame
Row β ID Pair Pair_1 Pair_original
β String String? String String
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 0001P:ABCD PairOne PairOne True
2 β 0001R:ABCD PairOne PairOne True
3 β 2324R:CDCL Two Two True
4 β 5666TB:GHLA Two Two True
5 β 2324P:CDCL missing Two True
6 β 789789TSB:ASLT missing 789789TSB:ASLT False
7 β 79067P:OPRA missing 79067P:OPRA&79067R:OPRA False
8 β 79067R:OPRA missing 79067P:OPRA&79067R:OPRA False
Summary:
- for all IDs, I want to add an entry in Pair_1. if an ID containing P: or R: has a reported Pair, but not for both of them then I would like to add create one for which it is missing as explained above
- If an ID contains P: or R: and both of these are reported as missing in
Pair
column then I would like to create an entry for them by concatenating their IDs (e.g. rows 7 and 8 above). - if an ID does not contain either P: or R: and Pair is reported as missing for it then its ID is the entry under Pair_1.
- Final column set outs whether the pair existed in the original list.
I tried the following to get the relevant IDs, but I am confused on next steps.
p_list = filter(s->occursin(r"P:",s),list1.ID)
r_list = filter(s->occursin(r"R:",s),list1.ID)
Any views on how I can produce list_compare2 from list1 and list2?