How to explode a DataFrame cell containing multiple values into separate rows?

Hi, I’m working on a project where sometimes I get data coming in with multiple values in one cell. Part of my feature engineering is to calculate the counts per user, and I’m looking for an effective way to explode out cells with multiple values into rows of their own, without double counting on the other rows.

An example will make things much clearer:

initial = DataFrame(:day=>["monday","monday","monday"],:name=>["bob","mike","phillis"],:temp=>["medium","high","low, medium"],:gender=>["male","male","female"])

Screenshot from 2020-07-07 09-23-57

transformed = DataFrame(:day=>["monday","monday","monday","duplicate"],:name=>["bob","mike","phillis","phillis"],:temp=>["medium","high","low","medium"],:gender=>["male","male","female","duplicate"])

Screenshot from 2020-07-07 09-24-55

Notice how the OTHER row values, the ones without multiple entries, are replaced with "“duplicate” in the newly exploded row. The reason I need this is because I’m calculating counts per user (pivot tables), and don’t want to double count for everything else.


You want to use flatten

julia> function make_longer(initial)
       initial[!, "temp2"] = split.(initial.temp)
       flatten(initial, "temp2")

Thank you, what would be the quickest way to scan the dataframe and check if any columns indeed have a cell with more than one value? I need to generalize this so that I can create a preprocessing function that scans all columns of the dataframe and splits/flattens appropriately if needed, otherwise leaves it alone.

Probably checking for commas or some other delimiter. That might be hard to generalize, of course.

If you want more performance you can also do a filter for rows where the value can be split, then do flatten on that subset and append! the new data frame onto the old one.

Ok thank you. And then the last part is, the way the implementation is now, I’ll inevitably double count every other column that didn’t need to be exploded out. Is there anyway that when calling the flatten function, I can replace all the values of the newly created row that AREN’T the exploded column with a string value of “duplicate”. Like this:

Screenshot from 2020-07-07 09-24-55

I don’t think so. You would have to do that replace manually.

If the goal is really just to have counts per user and not get the data frame in the shape you requested, you can do

julia> initial.temp2 = split.(initial.temp)
3-element Array{Array{SubString{String},1},1}:
 ["low,", "medium"]

julia> initial
3×5 DataFrame
│ Row │ day    │ name    │ temp        │ gender │ temp2              │
│     │ String │ String  │ String      │ String │ Array…             │
│ 1   │ monday │ bob     │ medium      │ male   │ ["medium"]         │
│ 2   │ monday │ mike    │ high        │ male   │ ["high"]           │
│ 3   │ monday │ phillis │ low, medium │ female │ ["low,", "medium"] │

julia> initial.count = length.(initial.temp2)
3-element Array{Int64,1}:

julia> count = sum(initial.count)

ok thank you.

whats the quickest way to strip the commas off of the split method applied on the dataframe?

Screenshot from 2020-07-07 12-22-37

I dont want male and male, lol

strip, replace using regular expressions`. There are many options, see the manual here.