Hi,
I wonder if there is an efficient way of doing the following in Julia DataFrames.
Given a DataFrame
df = DataFrame(col1 = ["a", "a", "b", "b", "c", "d"],
col2 = ["p", "q", "q", "r", "s", "t"])
I want to create a new column col3 as follows
df_new = DataFrame(col1 = ["a", "a", "b", "b", "c", "d"],
col2 = ["p", "q", "q", "r", "s", "t"],
col3 = ["br", "br", "br", "br", "cs", "dt"])
The values of col3 are computed as follows:
- rows are grouped such that each of them has common values in either of col1 and col2 with (at least one) some of the rows.
- For such groups, the values of col3 are computed as the :col1 * :col2 of the last row.
In the above example, if we group the rows w.r.t. col1 and apply the rule 2, the col3 is given by
col3 = ["aq", "aq", "br", "br", "cs", "dt"]
whereas grouping in terms of col2 leads to
col3 = ["ap", "bq", "bq", "br", "cs", "dt"]
I want to take the union of the two relations above (= w.r.t col1 or col2). Schematically, we get
(1st ~ 2nd and 3rd ~ 4th) + (2nd ~ 3rd) = (1st ~ 2nd ~ 3rd ~ 4th) ,
so that col3 becomes
col3 = ["br", "br", "br", "br", "cs", "dt"]