Hi, I’m working on a project where sometimes I get data coming in with multiple values in one cell. Part of my feature engineering is to calculate the counts per user, and I’m looking for an effective way to explode out cells with multiple values into rows of their own, without double counting on the other rows.
An example will make things much clearer:
initial = DataFrame(:day=>["monday","monday","monday"],:name=>["bob","mike","phillis"],:temp=>["medium","high","low, medium"],:gender=>["male","male","female"])
transformed = DataFrame(:day=>["monday","monday","monday","duplicate"],:name=>["bob","mike","phillis","phillis"],:temp=>["medium","high","low","medium"],:gender=>["male","male","female","duplicate"])
Notice how the OTHER row values, the ones without multiple entries, are replaced with "“duplicate” in the newly exploded row. The reason I need this is because I’m calculating counts per user (pivot tables), and don’t want to double count for everything else.
Thanks!