How to explode a DataFrame cell containing multiple values into separate rows?

Julia1 · July 7, 2020, 1:26pm

Hi, I’m working on a project where sometimes I get data coming in with multiple values in one cell. Part of my feature engineering is to calculate the counts per user, and I’m looking for an effective way to explode out cells with multiple values into rows of their own, without double counting on the other rows.

An example will make things much clearer:

initial = DataFrame(:day=>["monday","monday","monday"],:name=>["bob","mike","phillis"],:temp=>["medium","high","low, medium"],:gender=>["male","male","female"])

Screenshot from 2020-07-07 09-23-57

transformed = DataFrame(:day=>["monday","monday","monday","duplicate"],:name=>["bob","mike","phillis","phillis"],:temp=>["medium","high","low","medium"],:gender=>["male","male","female","duplicate"])

Screenshot from 2020-07-07 09-24-55

Notice how the OTHER row values, the ones without multiple entries, are replaced with "“duplicate” in the newly exploded row. The reason I need this is because I’m calculating counts per user (pivot tables), and don’t want to double count for everything else.

Thanks!

pdeffebach · July 7, 2020, 1:35pm

You want to use flatten

julia> function make_longer(initial)
       initial[!, "temp2"] = split.(initial.temp)
       flatten(initial, "temp2")
       end

Julia1 · July 7, 2020, 1:53pm

Thank you, what would be the quickest way to scan the dataframe and check if any columns indeed have a cell with more than one value? I need to generalize this so that I can create a preprocessing function that scans all columns of the dataframe and splits/flattens appropriately if needed, otherwise leaves it alone.

pdeffebach · July 7, 2020, 2:00pm

Probably checking for commas or some other delimiter. That might be hard to generalize, of course.

If you want more performance you can also do a filter for rows where the value can be split, then do flatten on that subset and append! the new data frame onto the old one.

Julia1 · July 7, 2020, 2:48pm

Ok thank you. And then the last part is, the way the implementation is now, I’ll inevitably double count every other column that didn’t need to be exploded out. Is there anyway that when calling the flatten function, I can replace all the values of the newly created row that AREN’T the exploded column with a string value of “duplicate”. Like this:

Screenshot from 2020-07-07 09-24-55

pdeffebach · July 7, 2020, 3:16pm

I don’t think so. You would have to do that replace manually.

If the goal is really just to have counts per user and not get the data frame in the shape you requested, you can do

julia> initial.temp2 = split.(initial.temp)
3-element Array{Array{SubString{String},1},1}:
 ["medium"]
 ["high"]
 ["low,", "medium"]

julia> initial
3×5 DataFrame
│ Row │ day    │ name    │ temp        │ gender │ temp2              │
│     │ String │ String  │ String      │ String │ Array…             │
├─────┼────────┼─────────┼─────────────┼────────┼────────────────────┤
│ 1   │ monday │ bob     │ medium      │ male   │ ["medium"]         │
│ 2   │ monday │ mike    │ high        │ male   │ ["high"]           │
│ 3   │ monday │ phillis │ low, medium │ female │ ["low,", "medium"] │

julia> initial.count = length.(initial.temp2)
3-element Array{Int64,1}:
 1
 1
 2

julia> count = sum(initial.count)
4

Julia1 · July 7, 2020, 4:23pm

ok thank you.

whats the quickest way to strip the commas off of the split method applied on the dataframe?

Screenshot from 2020-07-07 12-22-37

I dont want male and male, lol

pdeffebach · July 7, 2020, 4:25pm

strip, replace using regular expressions`. There are many options, see the manual here.

Topic		Replies	Views
`explode`-like function in DataFrames Data dataframes	3	189	May 10, 2024
Split dataframe row into multiple rows Data dataframes	8	1787	May 1, 2022
Is there any function in julia equivalent to pandas explode function Data dataframes	5	827	January 24, 2022
Expand row to multiple rows Data	11	1284	December 15, 2020
Separating a column into a variable number of possible columns Data dataframes	22	1385	October 3, 2021

How to explode a DataFrame cell containing multiple values into separate rows?

Related topics