Groupby / reshaping dataframe with unique values

Smith · December 2, 2020, 8:54pm

How can I rearrange a dataframe so that values of a particular column that are repeated appear only once and the related data is transposed to columns?

Suppose I have a dataframe given by:

df1 = DataFrame(NAME = ["A1","A1","A1","A2","A2","A3","A3"], CAT = ["FIN","NF","INF","UTL","GT","CP","MP"])

I would like to reshape df1 so that it appears like df2, i.e. entries in the column NAME are not repeated.

df2= DataFrame(NAME = ["A1","A2","A3"], CAT1=["FIN","NF","INF"],CAT2=["UTL","GT",""],CAT3=["CP","MP",""])

I have tried to use groupby function, but no luck. How can I rearrange a dataframe?

pdeffebach · December 2, 2020, 10:01pm

This should do it

julia> gd = groupby(df1, "NAME");

julia> max_cols = mapreduce(d -> length(unique(d.CAT)), max, gd);

julia> combine(gd) do sdf
       cs = unique(sdf.CAT)
       cs_out = [i <= length(cs) ? cs[i] : "" for i in 1:max_rows]
       colnames_out = [Symbol("CAT", i) for i in 1:max_cols]
       NamedTuple{Tuple(colnames_out)}(Tuple(cs_out))
       end
3×4 DataFrame
 Row │ NAME    CAT1    CAT2    CAT3   
     │ String  String  String  String 
─────┼────────────────────────────────
   1 │ A1      FIN     NF      INF
   2 │ A2      UTL     GT
   3 │ A3      CP      MP

Smith · December 3, 2020, 1:14am

Thanks. Is there a way of rearranging the data so that any data items ending with a particular character are moved to the last column in the dataframe?

Say, I have a dataframe

df_new = DataFrame(NAME = ["A1","A1","A1","A2","A2","A2","A2","A3","A3","A4","A5","A6","A6","A6","A6","A6","A6","A6"],
                CAT = ["FIN","NF","INF","AP","CF","UTL","GT","CP","MP","AP","BE","NF","CF","PP","AC","PE","OP","APA"])
gd = groupby(df_new, "NAME");
max_cols = mapreduce(d -> length(unique(d.CAT)), max, gd)
max_rows = max_cols
df_transform=combine(gd) do sdf
       cs = unique(sdf.CAT)
       cs_out = [i <= length(cs) ? cs[i] : "" for i in 1:max_rows]
       colnames_out = [Symbol("CAT", i) for i in 1:max_cols]
       NamedTuple{Tuple(colnames_out)}(Tuple(cs_out))
       end

Is it possible to move the entries ending with P or F in df_transform further to the right? e.g. in Row 2 AP and CF should be in columns CAT3 / CAT4, and GT and UTL maybe moved to columns CAT1 / CAT2. Similarly, in Row 6 the items NF, CF, PP and OP should be in the right most columns.

pdeffebach · December 3, 2020, 1:18am

That sounds like a difficult problem!

You would have to think of a rule to apply every time you construct the cs_out vector

Smith · December 3, 2020, 1:52am

Thanks. Definitely it is a hard problem. Maybe an easier way could be to sort original data frame by the second column so as to get some sort of order. Then transforming it using the code you provided. This will probably not work for the MWE but it may work in some cases

rocco_sprmnt21 · December 9, 2020, 9:58pm

I am absolutely a julia beginner. I would like to submit you a slightly different proposal

 function addIDX(dfgr)

   df=DataFrame(dfgr)

       df.IDX=["cat"*string(i) for i in 1:nrow(df)]

       return df

  end

   dff=reduce(vcat, [addIDX(gr) for gr in groupby(df1, :NAME)])

   unstack(dff,:IDX,:CAT)

I would like to know if and how it is possible to use a function similar to addIDX inside groupby, thus avoiding the use of the comprehension

Mattriks · December 10, 2020, 7:49am

Here’s my solution (which sorts alphabetically on the last letter). Not quite there, but is closer (and succinct):

sort!(df_new, [:NAME, :CAT], by=[identity, x->x[end]])
transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row) 
coalesce.(unstack(df_new, :row, :CAT), "")

rocco_sprmnt21 · December 10, 2020, 3:08pm

is it possible to get the same result obtined using the function addIDX and the list comprehension?

May be using some function in a similar way that done in the folliwing

transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row)

?

rocco_sprmnt21 · December 10, 2020, 3:22pm

Hi Mattriks

could you, please, explain in detail how the function transform works?
the function works on each subframe at each step or row by row?
what exactly is passed to the composite function collect°first°axes?
what exactly outputs?

pdeffebach · December 10, 2020, 3:47pm

transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row)

We have transform(x, src => fun => dest)

x is a GroupedDataFrame, so fun will act on the vectors of each SubDataFrame
src is :NAME, meaning transform will pass subvectors of df.NAME to fun
* fun is collect∘first∘axes. This takes in one vector argument, which means it won’t break when given a single sub-vector of df.NAME. It outputs a single Vector. But tbh I don’t really understand what it’s doing. It could probably done in a simpler way. Note also that the output is the same length as the input, which is required for transform, which cannot re-size a data frame.
dest is :row. This is a single symbol, which matches that fun outputs a single vector, so this doesn’t break.

rocco_sprmnt21 · December 10, 2020, 5:07pm

many tanks!

I believe that the goal is to obtain a progressive index [1,2, …n] per subgroup.
Something like 1: row (subgroup) that I tried but the nrow function is not accepted in this way.
I wonder how, in a synthetic way (even nesting more functions), a new column of the type [“prefix” 1, “prefix” 2, … “prefix” nrow] can be obtained

pdeffebach · December 10, 2020, 5:10pm

Something like

:NAME => (t -> string.(1:length(t)), "_", t) => :row

should do that easily.

rocco_sprmnt21 · December 11, 2020, 8:49pm

@pdeffebach tanks again!.

in the end I managed to figure out how to do it (there was a small typo: a shifted parenthesis) and this is just what I hoped to be able to do. But now, after understanding (perhaps) the use of the string function (I thought it was only used to convert formats but not to concatenate strings), I am left with a curiosity about how the broadcast string version works. I understand that it works as I imagine for situations like: string (el1, el2, arr1) where el1, el2 are scalars and arr1 a one-dimensional vector. Out of curiosity I tried something like this, but it didn’t work: string (“e1”, [‘a’, ‘b’], 1: 4) while something like this worked: string (“e1”, [‘a’, ‘b’, ‘c’, ‘d’], 1: 4). I am sure that this is an argument already covered and I wonder why a function has not developed that works with inputs of this type: string (1, ‘a’: ‘d’, 1: 3, 1:12) where the larger vector has a size that is a multiple of the size of the smaller ones

pdeffebach · December 11, 2020, 9:20pm

Are you coming from R? As this is common in R.

In Julia, broadcasting is explicit with the . and non-scalar arguments are required to have the same length. This is a good thing! Not everyone likes the automatic recycling in R. I’m not a fan of it personally. I think it can lead to bugs where things that should have the same length don’t, but the code doesn’t error when it should.

rocco_sprmnt21 · December 11, 2020, 10:36pm

Thank you for the explanation.
I think I understand the rationale behind this choice.

No, I’m practically from the desert.
I know R by name only.
For part of my work I use excel and lately (during the lockdown for covid) I have practiced the M language with Power Query which I really liked.

pdeffebach · December 11, 2020, 11:21pm

Welcome! Do not hesitate to ask more questions.

Smith · December 18, 2020, 8:23pm

Thank you

rocco_sprmnt21 · December 19, 2020, 11:22pm

try this

dft=combine(groupby(df,"NAME"), [:CAT=>(t->sort!(t,by= x->(last(x) in "PF" ? lowercase(last(x)) : last(x))))=>:CAT,:CAT=>(t->string.("CAT",1:length(t)))=>:mm]) 
unstack(dft,:mm,:CAT)

Topic		Replies	Views
Changing many rows to single row julia1.5.3 Data question	8	585	December 13, 2020
Manipulation of dataframe rows upon repeated values in a given column New to Julia	5	711	April 15, 2021
DataFrames unique - keep last occurence Data dataframes	6	1900	April 21, 2021
How many ways can it be reshaped? Data dataframes , reshaping	3	411	May 21, 2021
UnStack dataframe with aggregated string column General Usage dataframes	3	297	May 14, 2022

Groupby / reshaping dataframe with unique values

Related topics