Pivot a dataframe to wide format with values in multiple columns

Rajesh_Talluri · September 1, 2020, 7:02pm

Hi, I am trying to perform a long to wide transformation of a data frame in Julia. The original data frame I have is:

wide = DataFrame(x = 1:12,
       a  = 2:13,
       b  = 3:14,
       val1  = randn(12),
       val2  = randn(12),
       cname = repeat(["c", "d"], inner =6)
       )

12×6 DataFrame
│ Row │ x     │ a     │ b     │ val1      │ val2      │ cname  │
│     │ Int64 │ Int64 │ Int64 │ Float64   │ Float64   │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┤
│ 1   │ 1     │ 2     │ 3     │ 1.51014   │ -1.18548  │ c      │
│ 2   │ 2     │ 3     │ 4     │ 0.0845411 │ -0.370083 │ c      │
│ 3   │ 3     │ 4     │ 5     │ 0.826283  │ -1.00423  │ c      │
│ 4   │ 4     │ 5     │ 6     │ -0.53175  │ -1.16659  │ c      │
│ 5   │ 5     │ 6     │ 7     │ -1.77975  │ 0.336333  │ c      │
│ 6   │ 6     │ 7     │ 8     │ 0.632577  │ 0.236621  │ c      │
│ 7   │ 7     │ 8     │ 9     │ -0.681532 │ 1.14869   │ d      │
│ 8   │ 8     │ 9     │ 10    │ -0.775619 │ 0.393475  │ d      │
│ 9   │ 9     │ 10    │ 11    │ -0.533034 │ 0.059624  │ d      │
│ 10  │ 10    │ 11    │ 12    │ 0.496152  │ -1.23507  │ d      │
│ 11  │ 11    │ 12    │ 13    │ 0.834099  │ 2.12115   │ d      │
│ 12  │ 12    │ 13    │ 14    │ 0.532357  │ -0.369267 │ d      │

I am trying to mimic the pivot_wider function in R:

wide %>% pivot_wider(names_from = cname, values_from = c(val1,val2))

===  ===  ===  ==========  ==========  ==========  ==========
  x    a    b      val1_c      val1_d      val2_c      val2_d
===  ===  ===  ==========  ==========  ==========  ==========
  1    2    3   1.0174232          NA  -0.6611959          NA
  2    3    4   0.6590795          NA  -2.0954505          NA
  3    4    5   1.2939581          NA   1.6350356          NA
  4    5    6  -1.9395356          NA   0.7813238          NA
  5    6    7   0.3558087          NA   0.9789414          NA
  6    7    8   0.9859100          NA  -0.9803336          NA
  7    8    9          NA   0.4949224          NA  -0.0659333
  8    9   10          NA   0.5024755          NA  -0.2317832
  9   10   11          NA   1.6926897          NA  -0.3840687
 10   11   12          NA  -0.4324705          NA  -0.0901276
 11   12   13          NA  -0.6415260          NA   0.0014151
 12   13   14          NA   1.2406868          NA  -2.1959740
===  ===  ===  ==========  ==========  ==========  ==========

unstack seems similar but I could not get it to work.

I was trying

unstack(wide, [:x, :a,:b], :cname, [:val1,:val2])

Could someone please help me with anything that I am missing here?

jmboehm · September 1, 2020, 9:16pm

There may be a way to do it all at once, but unstacking it separately for var1 and var2 and then joining would work:

w1 = unstack(wide, [:x,:a,:b], :cname, :val1)
w2 = unstack(wide, [:x,:a,:b], :cname, :val2)
rename!(w1,:c => :val1_c)
rename!(w1,:d => :val1_d)
rename!(w2,:c => :val2_c)
rename!(w2,:d => :val2_d)
w = innerjoin(w1,w2, on= [:x,:a,:b])

I’ve written a little package (called Douglass.jl) that wraps such operations using syntax similar to Stata. As of now, it only works with one id variable though:

using Douglass
wide = DataFrame(x = 1:12,
       a  = 2:13,
       b  = 3:14,
       val1  = randn(12),
       val2  = randn(12),
       cname = repeat(["c", "d"], inner =6)
       )
set_active_df(:wide)
d"reshape_wide :val1 :val2 , i(:x) j(:cname)"
wide

12×5 DataFrame
│ Row │ x     │ val1c     │ val1d       │ val2c     │ val2d     │
│     │ Int64 │ Float64?  │ Float64?    │ Float64?  │ Float64?  │
├─────┼───────┼───────────┼─────────────┼───────────┼───────────┤
│ 1   │ 1     │ -0.56325  │ missing     │ -0.386099 │ missing   │
│ 2   │ 2     │ -1.06118  │ missing     │ -0.464812 │ missing   │
│ 3   │ 3     │ 0.786299  │ missing     │ 0.168381  │ missing   │
│ 4   │ 4     │ 0.233934  │ missing     │ 0.546724  │ missing   │
│ 5   │ 5     │ -3.04186  │ missing     │ -1.38031  │ missing   │
│ 6   │ 6     │ -0.599674 │ missing     │ -1.5068   │ missing   │
│ 7   │ 7     │ missing   │ 1.23619     │ missing   │ 0.948665  │
│ 8   │ 8     │ missing   │ 1.12558     │ missing   │ -0.869007 │
│ 9   │ 9     │ missing   │ 0.521347    │ missing   │ 0.661981  │
│ 10  │ 10    │ missing   │ -0.00579468 │ missing   │ 1.54164   │
│ 11  │ 11    │ missing   │ 1.2072      │ missing   │ 1.41402   │
│ 12  │ 12    │ missing   │ -1.50068    │ missing   │ 0.67358   │

pdeffebach · September 2, 2020, 12:57am

Can you file an issue in DataFrames? This would be a nice feature.

Rajesh_Talluri · September 2, 2020, 4:15am

Sure, I submitted an issue on DataFrames.

https://github.com/JuliaData/DataFrames.jl/issues/2405

If these transformations are not implemented in DataFrames, I think it would be good to emulate the R functions pivot_longer and pivot_wider which cover a variety of cases and are very general.

xiaodai · September 2, 2020, 5:06am

For me, this would be a nice use-case, but if you want to do it you can do this first

Way 1

# create multiple ones
dfs = unstack.(Ref(wide), Ref([:x, :a,:b]), Ref(:cname), [:val1, :val2])

reduce((df1, df2)->outerjoin(df1, df2, on=[:x, :a, :b], makeunique = true), dfs)

Way 2 Probably more efficient.

dfs = unstack.(Ref(wide), Ref([:x, :a,:b]), Ref(:cname), [:val1, :val2])

select!.(dfs[2:end], Ref(Not([:x, :a,:b])))
reduce((df1, df2) -> hcat(df1, df2, makeunique = true), dfs)

Rajesh_Talluri · September 8, 2020, 5:23am

Thank you @xiaodai! Following your implementation, I wrote the function to unstack with multiple name columns and multiple value columns. I am posting it here as it may help others before it its implemented in DataFrames.

This function is probably very inefficient performance-wise, as I am still learning and could not figure out some basic things. Any tips to improve it are appreciated.

function pivot_wider(df::AbstractDataFrame, names_cols, values_cols)
    function unstack_name(df::AbstractDataFrame, rowkeys, colkey, value)
        unstack(df::AbstractDataFrame, rowkeys, colkey, value, renamecols = x -> Symbol(string(value), "_", x))
    end
    id_cols = setdiff(Symbol.(names(df)), names_cols,values_cols)
    dfa = select(df,:)
    dfa[!,:cname_temp] = [join(Array(r),"_") for r in eachrow(select(dfa,names_cols))]
    dfb = unstack_name.(Ref(dfa), Ref(id_cols), Ref(:cname_temp), values_cols)
    id1 = select(dfb[1],id_cols)
    select!.(dfb, Ref(Not(id_cols)))
    bb = reduce((df1, df2) -> hcat(df1, df2, makeunique = true), dfb)
    hcat(id1,bb)
end

The example to test the function

wide = DataFrame(x = repeat(1:3,inner = 2,outer = 2),
       a = repeat(4:6,inner = 2,outer = 2),
       b = repeat(7:9,inner = 2,outer = 2),
       val1 = ["ce_val1_1","cf_val1_1","ce_val1_2","cf_val1_2","ce_val1_3","cf_val1_3","de_val1_1","df_val1_1","de_val1_2","df_val1_2","de_val1_3","df_val1_3"], 
       val2 = ["ce_val2_1","cf_val2_1","ce_val2_2","cf_val2_2","ce_val2_3","cf_val2_3","de_val2_1","df_val2_1","de_val2_2","df_val2_2","de_val2_3","df_val2_3"], 
       cname1 = repeat(["c", "d"], inner = 6),
       cname2 = repeat(["e", "f"], 6)
       )

12×7 DataFrame
│ Row │ x     │ a     │ b     │ val1      │ val2      │ cname1 │ cname2 │
│     │ Int64 │ Int64 │ Int64 │ String    │ String    │ String │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┼────────┤
│ 1   │ 1     │ 4     │ 7     │ ce_val1_1 │ ce_val2_1 │ c      │ e      │
│ 2   │ 1     │ 4     │ 7     │ cf_val1_1 │ cf_val2_1 │ c      │ f      │
│ 3   │ 2     │ 5     │ 8     │ ce_val1_2 │ ce_val2_2 │ c      │ e      │
│ 4   │ 2     │ 5     │ 8     │ cf_val1_2 │ cf_val2_2 │ c      │ f      │
│ 5   │ 3     │ 6     │ 9     │ ce_val1_3 │ ce_val2_3 │ c      │ e      │
│ 6   │ 3     │ 6     │ 9     │ cf_val1_3 │ cf_val2_3 │ c      │ f      │
│ 7   │ 1     │ 4     │ 7     │ de_val1_1 │ de_val2_1 │ d      │ e      │
│ 8   │ 1     │ 4     │ 7     │ df_val1_1 │ df_val2_1 │ d      │ f      │
│ 9   │ 2     │ 5     │ 8     │ de_val1_2 │ de_val2_2 │ d      │ e      │
│ 10  │ 2     │ 5     │ 8     │ df_val1_2 │ df_val2_2 │ d      │ f      │
│ 11  │ 3     │ 6     │ 9     │ de_val1_3 │ de_val2_3 │ d      │ e      │
│ 12  │ 3     │ 6     │ 9     │ df_val1_3 │ df_val2_3 │ d      │ f      │

pivot_wider(wide, [:cname1,:cname2], [:val1,:val2])

3×11 DataFrame
│ Row │ x     │ a     │ b     │ val1_c_e  │ val1_c_f  │ val1_d_e  │ val1_d_f  │ val2_c_e  │ val2_c_f  │ val2_d_e  │ val2_d_f  │
│     │ Int64 │ Int64 │ Int64 │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │
├─────┼───────┼───────┼───────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 1     │ 4     │ 7     │ ce_val1_1 │ cf_val1_1 │ de_val1_1 │ df_val1_1 │ ce_val2_1 │ cf_val2_1 │ de_val2_1 │ df_val2_1 │
│ 2   │ 2     │ 5     │ 8     │ ce_val1_2 │ cf_val1_2 │ de_val1_2 │ df_val1_2 │ ce_val2_2 │ cf_val2_2 │ de_val2_2 │ df_val2_2 │
│ 3   │ 3     │ 6     │ 9     │ ce_val1_3 │ cf_val1_3 │ de_val1_3 │ df_val1_3 │ ce_val2_3 │ cf_val2_3 │ de_val2_3 │ df_val2_3 │

xiaodai · September 8, 2020, 6:33am

Rajesh_Talluri:

function unstack_name(df::AbstractDataFrame, rowkeys, colkey, value)
        unstack(df::AbstractDataFrame, rowkeys, colkey, value, renamecols = x -> Symbol(string(value), "_", x))
    end

I am not sure, but this could cause performance issues. Keep it outside as a separate function is better.

xiaodai · September 8, 2020, 6:34am

Since you like helping others, I would welcome a PR to GitHub - xiaodaigh/TidyStanza.jl: Attempting to implement some {tidyverse} APIs in Julia and work thru it so that it’s available for others to use.

Rajesh_Talluri · September 8, 2020, 12:38pm

This is a great idea! Dplyr and tidyverse are the greatest strengths of R. Data processing is so much easier and concise using dplyr. The functions are well thought of in terms of functionality. I think it is a worthwhile project to emulate it. Happy to help! And nice name Hadleyverse

xiaodai · September 8, 2020, 12:51pm

Well my first naming attempt was Dply.jl but packages can’t have less than 5 letters, so I had to make something unique but not take up some common word.

nilshg · September 8, 2020, 3:19pm

DplyJ.jl?

xiaodai · September 8, 2020, 11:36pm

Package also can’t end in capital letter.

.jl already conveys Julia

nilshg · September 9, 2020, 5:38am

(Sorry this discussion is entirely off topic, just one last point: The J would be consistent with MLJ, which as I understand it was inspired by the R package MLR, and is also a good example for a package with capital letters only and less than 5 letters. The naming conventions I think you’re referring to are just for the automatic merging of the registration of packages in the General registry, so you can deviate from them, it just required manual merging)

xiaodai · September 9, 2020, 5:46am

Yeah. I like to keep to auto rules. To lessen burden on maintainers. I guess MLJ.jl is nice.

xiaodai · September 10, 2020, 3:51am

@Rajesh_Talluri I’ve registered Hadleyverse 0.1.1 with pivot_wider See https://github.com/xiaodaigh/Hadleyverse.jl#pivot_wider

Rajesh_Talluri · September 10, 2020, 4:03am

Thanks so much for helping me out!

Topic		Replies	Views
How (best) to transform a huge DataFrame into wide-format General Usage dataframes	8	115	December 4, 2024
How to reshape wide to long? General Usage	13	941	June 24, 2022
New DataFrame whose columns are values of a column and grouped by another column General Usage dataframes	2	328	May 14, 2021
Memory efficient melting and stacking? Reshaping Wide to Long Performance	11	1517	November 7, 2020
Basic pivoting/widening of a table General Usage	1	257	January 23, 2021

Pivot a dataframe to wide format with values in multiple columns

Related topics