Groupby / reshaping dataframe with unique values

How can I rearrange a dataframe so that values of a particular column that are repeated appear only once and the related data is transposed to columns?

Suppose I have a dataframe given by:

df1 = DataFrame(NAME = ["A1","A1","A1","A2","A2","A3","A3"], CAT = ["FIN","NF","INF","UTL","GT","CP","MP"]) 

I would like to reshape df1 so that it appears like df2, i.e. entries in the column NAME are not repeated.

df2= DataFrame(NAME = ["A1","A2","A3"], CAT1=["FIN","NF","INF"],CAT2=["UTL","GT",""],CAT3=["CP","MP",""])

I have tried to use groupby function, but no luck. How can I rearrange a dataframe?

This should do it

julia> gd = groupby(df1, "NAME");

julia> max_cols = mapreduce(d -> length(unique(d.CAT)), max, gd);

julia> combine(gd) do sdf
       cs = unique(sdf.CAT)
       cs_out = [i <= length(cs) ? cs[i] : "" for i in 1:max_rows]
       colnames_out = [Symbol("CAT", i) for i in 1:max_cols]
       NamedTuple{Tuple(colnames_out)}(Tuple(cs_out))
       end
3×4 DataFrame
 Row │ NAME    CAT1    CAT2    CAT3   
     │ String  String  String  String 
─────┼────────────────────────────────
   1 │ A1      FIN     NF      INF
   2 │ A2      UTL     GT
   3 │ A3      CP      MP

2 Likes

Thanks. Is there a way of rearranging the data so that any data items ending with a particular character are moved to the last column in the dataframe?

Say, I have a dataframe

df_new = DataFrame(NAME = ["A1","A1","A1","A2","A2","A2","A2","A3","A3","A4","A5","A6","A6","A6","A6","A6","A6","A6"],
                CAT = ["FIN","NF","INF","AP","CF","UTL","GT","CP","MP","AP","BE","NF","CF","PP","AC","PE","OP","APA"])
gd = groupby(df_new, "NAME");
max_cols = mapreduce(d -> length(unique(d.CAT)), max, gd)
max_rows = max_cols
df_transform=combine(gd) do sdf
       cs = unique(sdf.CAT)
       cs_out = [i <= length(cs) ? cs[i] : "" for i in 1:max_rows]
       colnames_out = [Symbol("CAT", i) for i in 1:max_cols]
       NamedTuple{Tuple(colnames_out)}(Tuple(cs_out))
       end

Is it possible to move the entries ending with P or F in df_transform further to the right? e.g. in Row 2 AP and CF should be in columns CAT3 / CAT4, and GT and UTL maybe moved to columns CAT1 / CAT2. Similarly, in Row 6 the items NF, CF, PP and OP should be in the right most columns.

That sounds like a difficult problem!

You would have to think of a rule to apply every time you construct the cs_out vector

Thanks. Definitely it is a hard problem. Maybe an easier way could be to sort original data frame by the second column so as to get some sort of order. Then transforming it using the code you provided. This will probably not work for the MWE but it may work in some cases

I am absolutely a julia beginner. I would like to submit you a slightly different proposal

 function addIDX(dfgr)

   df=DataFrame(dfgr)

       df.IDX=["cat"*string(i) for i in 1:nrow(df)]

       return df

  end

   dff=reduce(vcat, [addIDX(gr) for gr in groupby(df1, :NAME)])

   unstack(dff,:IDX,:CAT)

I would like to know if and how it is possible to use a function similar to addIDX inside groupby, thus avoiding the use of the comprehension

Here’s my solution (which sorts alphabetically on the last letter). Not quite there, but is closer (and succinct):

sort!(df_new, [:NAME, :CAT], by=[identity, x->x[end]])
transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row) 
coalesce.(unstack(df_new, :row, :CAT), "")
2 Likes

is it possible to get the same result obtined using the function addIDX and the list comprehension?

May be using some function in a similar way that done in the folliwing

transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row) 

?

Hi Mattriks

could you, please, explain in detail how the function transform works?
the function works on each subframe at each step or row by row?
what exactly is passed to the composite function collect°first°axes?
what exactly outputs?

transform!(groupby(df_new, :NAME), :NAME=>collect∘first∘axes=>:row) 

We have transform(x, src => fun => dest)

  • x is a GroupedDataFrame, so fun will act on the vectors of each SubDataFrame
  • src is :NAME, meaning transform will pass subvectors of df.NAME to fun
    * fun is collect∘first∘axes. This takes in one vector argument, which means it won’t break when given a single sub-vector of df.NAME. It outputs a single Vector. But tbh I don’t really understand what it’s doing. It could probably done in a simpler way. Note also that the output is the same length as the input, which is required for transform, which cannot re-size a data frame.
  • dest is :row. This is a single symbol, which matches that fun outputs a single vector, so this doesn’t break.

many tanks!

I believe that the goal is to obtain a progressive index [1,2, …n] per subgroup.
Something like 1: row (subgroup) that I tried but the nrow function is not accepted in this way.
I wonder how, in a synthetic way (even nesting more functions), a new column of the type [“prefix” 1, “prefix” 2, … “prefix” nrow] can be obtained

Something like

:NAME => (t -> string.(1:length(t)), "_", t) => :row 

should do that easily.

@pdeffebach tanks again!.

in the end I managed to figure out how to do it (there was a small typo: a shifted parenthesis) and this is just what I hoped to be able to do. But now, after understanding (perhaps) the use of the string function (I thought it was only used to convert formats but not to concatenate strings), I am left with a curiosity about how the broadcast string version works. I understand that it works as I imagine for situations like: string (el1, el2, arr1) where el1, el2 are scalars and arr1 a one-dimensional vector. Out of curiosity I tried something like this, but it didn’t work: string (“e1”, [‘a’, ‘b’], 1: 4) while something like this worked: string (“e1”, [‘a’, ‘b’, ‘c’, ‘d’], 1: 4). I am sure that this is an argument already covered and I wonder why a function has not developed that works with inputs of this type: string (1, ‘a’: ‘d’, 1: 3, 1:12) where the larger vector has a size that is a multiple of the size of the smaller ones

Are you coming from R? As this is common in R.

In Julia, broadcasting is explicit with the . and non-scalar arguments are required to have the same length. This is a good thing! Not everyone likes the automatic recycling in R. I’m not a fan of it personally. I think it can lead to bugs where things that should have the same length don’t, but the code doesn’t error when it should.

1 Like

Thank you for the explanation.
I think I understand the rationale behind this choice.

No, I’m practically from the desert.
I know R by name only.
For part of my work I use excel and lately (during the lockdown for covid) I have practiced the M language with Power Query which I really liked.

Welcome! Do not hesitate to ask more questions.

1 Like

Thank you

try this

dft=combine(groupby(df,"NAME"), [:CAT=>(t->sort!(t,by= x->(last(x) in "PF" ? lowercase(last(x)) : last(x))))=>:CAT,:CAT=>(t->string.("CAT",1:length(t)))=>:mm]) 
unstack(dft,:mm,:CAT)