Manipulation of dataframe rows upon repeated values in a given column

I have the following dataframe (in fact, part of a much larger dataframe, where the Key’s repeat arbitrarily, sometimes even more than twice):

df_ini = DataFrame(
Key = [170, 447, 447, 699, 963, 963, 963, 756], 
Type = ["No", "No", "No", "Yes", "Yes", "Yes", "Yes", "No"],
Situation = ["Closed", "Pending", "Surpassed", "Surpassed", "Pending", "Surpassed", "Faulty", "Surpassed"]

I would like to manipulate df_ini so as to obtain the transformed dataframe:

df_fin = DataFrame(
Key = [170, 447, 699, 963, 756],
Type = ["No", "No", "Yes", "Yes", "No"],
Situation = ["Closed", "Pending, Surpassed", "Surpassed", "Pending, Surpassed, Faulty", "Surpassed"]

That is, the rows where the column Key are equal and have the corresponding column Situation different must be rendered into a single row, such that all the corresponding columns are the same, except for the column Situation, which must have a join of the String values in the original cells of the β€œunmerged” rows (separated by commas).

Thanks in advance.

I think you want

julia> combine(groupby(df_ini, :Key), :Type => first, :Situation => (x -> join(x, ", ")))
5Γ—3 DataFrame
 Row β”‚ Key    Type_first  Situation_function         
     β”‚ Int64  String      String                     
   1 β”‚   170  No          Closed
   2 β”‚   447  No          Pending, Surpassed
   3 β”‚   699  Yes         Surpassed
   4 β”‚   963  Yes         Pending, Surpassed, Faulty
   5 β”‚   756  No          Surpassed

If want to keep the column names you can set them like :Type => first => :Type.

As an aside, Type is a defined variable in every Julia session

julia> Float64 isa Type

so I wouldn’t recommend using it as a variable name.

1 Like

@nilshg Thank you for your prompt reply and solution, which, of course, worked for me as well.

Could you perhaps, however briefly, explain the logic of the command: first the groupby, then the combine operations? I will sure read about them at any rate.

Glad it helped!

combine (and its friend transform) together with groupby are probably two of the most useful functionalities of the DataFrames package. When you groupby a DataFrame, you can think of this as segmenting the DataFrame into separate Sub-DataFrames, the columns of which you can then apply functions to using combine or transform.

So in the example above, :Type => first means β€œgo through each group in my DataFrame, take the Type column for that group, and apply the first function (which just returns the first value)”.

Similarly, for :Situation, we want to take all the values within a group and join them together - for this we need the join function, but as that takes two arguments, we apply it as an anonymous function (x -> join(x, ", "), where x is the vector of values in the group.

This only scratches the surface, I highly recommend you read the full explanation by one of the main contributors to DataFrames here:


Thanks again so much @nilshg. I am really excited about my journey into Julia and how friendly the community is as a whole!



Here’s another link, probably more useful as a reference since it has fewer examples: Split-apply-combine Β· DataFrames.jl

1 Like