Manipulation of dataframe rows upon repeated values in a given column

mocalvao · April 15, 2021, 6:49pm

I have the following dataframe (in fact, part of a much larger dataframe, where the Key’s repeat arbitrarily, sometimes even more than twice):

df_ini = DataFrame(
Key = [170, 447, 447, 699, 963, 963, 963, 756], 
Type = ["No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No"],
Situation = ["Closed", "Pending", "Surpassed", "Surpassed", "Pending", "Surpassed", "Faulty", "Surpassed"]
)

I would like to manipulate df_ini so as to obtain the transformed dataframe:

df_fin = DataFrame(
Key = [170, 447, 699, 963, 756],
Type = ["No", "No", "Yes", "Yes", "No"],
Situation = ["Closed", "Pending, Surpassed", "Surpassed", "Pending, Surpassed, Faulty", "Surpassed"]
)

That is, the rows where the column Key are equal and have the corresponding column Situation different must be rendered into a single row, such that all the corresponding columns are the same, except for the column Situation, which must have a join of the String values in the original cells of the “unmerged” rows (separated by commas).

Thanks in advance.

nilshg · April 15, 2021, 7:23pm

I think you want

julia> combine(groupby(df_ini, :Key), :Type => first, :Situation => (x -> join(x, ", ")))
5×3 DataFrame
 Row │ Key    Type_first  Situation_function         
     │ Int64  String      String                     
─────┼───────────────────────────────────────────────
   1 │   170  No          Closed
   2 │   447  No          Pending, Surpassed
   3 │   699  Yes         Surpassed
   4 │   963  Yes         Pending, Surpassed, Faulty
   5 │   756  No          Surpassed

If want to keep the column names you can set them like :Type => first => :Type.

As an aside, Type is a defined variable in every Julia session

julia> Float64 isa Type
true

so I wouldn’t recommend using it as a variable name.

mocalvao · April 15, 2021, 7:29pm

@nilshg Thank you for your prompt reply and solution, which, of course, worked for me as well.

Could you perhaps, however briefly, explain the logic of the command: first the groupby, then the combine operations? I will sure read about them at any rate.

nilshg · April 15, 2021, 7:37pm

Glad it helped!

combine (and its friend transform) together with groupby are probably two of the most useful functionalities of the DataFrames package. When you groupby a DataFrame, you can think of this as segmenting the DataFrame into separate Sub-DataFrames, the columns of which you can then apply functions to using combine or transform.

So in the example above, :Type => first means “go through each group in my DataFrame, take the Type column for that group, and apply the first function (which just returns the first value)”.

Similarly, for :Situation, we want to take all the values within a group and join them together - for this we need the join function, but as that takes two arguments, we apply it as an anonymous function (x -> join(x, ", "), where x is the vector of values in the group.

This only scratches the surface, I highly recommend you read the full explanation by one of the main contributors to DataFrames here:

mocalvao · April 15, 2021, 7:43pm

Thanks again so much @nilshg. I am really excited about my journey into Julia and how friendly the community is as a whole!

Cheers

Jeff_Emanuel · April 15, 2021, 11:23pm

Here’s another link, probably more useful as a reference since it has fewer examples: Split-apply-combine · DataFrames.jl

Topic		Replies	Views
Combining a col from each DF group into a single DF New to Julia question , dataframes	5	296	August 25, 2022
Changing many rows to single row julia1.5.3 Data question	8	594	December 13, 2020
GroupedDataFrame combine unique non-key values General Usage question , dataframes	2	342	December 4, 2020
Translation groupby and agg and join python to julia General Usage	4	934	April 8, 2021
Groupby / reshaping dataframe with unique values Data data , dataframes	17	1480	December 19, 2020

Manipulation of dataframe rows upon repeated values in a given column

Related topics