Pivot a dataframe to wide format with values in multiple columns

Hi, I am trying to perform a long to wide transformation of a data frame in Julia. The original data frame I have is:

wide = DataFrame(x = 1:12,
       a  = 2:13,
       b  = 3:14,
       val1  = randn(12),
       val2  = randn(12),
       cname = repeat(["c", "d"], inner =6)
       )

12×6 DataFrame
│ Row │ x     │ a     │ b     │ val1      │ val2      │ cname  │
│     │ Int64 │ Int64 │ Int64 │ Float64   │ Float64   │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┤
│ 1   │ 1     │ 2     │ 3     │ 1.51014   │ -1.18548  │ c      │
│ 2   │ 2     │ 3     │ 4     │ 0.0845411 │ -0.370083 │ c      │
│ 3   │ 3     │ 4     │ 5     │ 0.826283  │ -1.00423  │ c      │
│ 4   │ 4     │ 5     │ 6     │ -0.53175  │ -1.16659  │ c      │
│ 5   │ 5     │ 6     │ 7     │ -1.77975  │ 0.336333  │ c      │
│ 6   │ 6     │ 7     │ 8     │ 0.632577  │ 0.236621  │ c      │
│ 7   │ 7     │ 8     │ 9     │ -0.681532 │ 1.14869   │ d      │
│ 8   │ 8     │ 9     │ 10    │ -0.775619 │ 0.393475  │ d      │
│ 9   │ 9     │ 10    │ 11    │ -0.533034 │ 0.059624  │ d      │
│ 10  │ 10    │ 11    │ 12    │ 0.496152  │ -1.23507  │ d      │
│ 11  │ 11    │ 12    │ 13    │ 0.834099  │ 2.12115   │ d      │
│ 12  │ 12    │ 13    │ 14    │ 0.532357  │ -0.369267 │ d      │

I am trying to mimic the pivot_wider function in R:

wide %>% pivot_wider(names_from = cname, values_from = c(val1,val2))

===  ===  ===  ==========  ==========  ==========  ==========
  x    a    b      val1_c      val1_d      val2_c      val2_d
===  ===  ===  ==========  ==========  ==========  ==========
  1    2    3   1.0174232          NA  -0.6611959          NA
  2    3    4   0.6590795          NA  -2.0954505          NA
  3    4    5   1.2939581          NA   1.6350356          NA
  4    5    6  -1.9395356          NA   0.7813238          NA
  5    6    7   0.3558087          NA   0.9789414          NA
  6    7    8   0.9859100          NA  -0.9803336          NA
  7    8    9          NA   0.4949224          NA  -0.0659333
  8    9   10          NA   0.5024755          NA  -0.2317832
  9   10   11          NA   1.6926897          NA  -0.3840687
 10   11   12          NA  -0.4324705          NA  -0.0901276
 11   12   13          NA  -0.6415260          NA   0.0014151
 12   13   14          NA   1.2406868          NA  -2.1959740
===  ===  ===  ==========  ==========  ==========  ==========

unstack seems similar but I could not get it to work.

I was trying

unstack(wide, [:x, :a,:b], :cname, [:val1,:val2]) 

Could someone please help me with anything that I am missing here?

There may be a way to do it all at once, but unstacking it separately for var1 and var2 and then joining would work:

w1 = unstack(wide, [:x,:a,:b], :cname, :val1)
w2 = unstack(wide, [:x,:a,:b], :cname, :val2)
rename!(w1,:c => :val1_c)
rename!(w1,:d => :val1_d)
rename!(w2,:c => :val2_c)
rename!(w2,:d => :val2_d)
w = innerjoin(w1,w2, on= [:x,:a,:b])

I’ve written a little package (called Douglass.jl) that wraps such operations using syntax similar to Stata. As of now, it only works with one id variable though:

using Douglass
wide = DataFrame(x = 1:12,
       a  = 2:13,
       b  = 3:14,
       val1  = randn(12),
       val2  = randn(12),
       cname = repeat(["c", "d"], inner =6)
       )
set_active_df(:wide)
d"reshape_wide :val1 :val2 , i(:x) j(:cname)"
wide
12×5 DataFrame
│ Row │ x     │ val1c     │ val1d       │ val2c     │ val2d     │
│     │ Int64 │ Float64?  │ Float64?    │ Float64?  │ Float64?  │
├─────┼───────┼───────────┼─────────────┼───────────┼───────────┤
│ 1   │ 1     │ -0.56325  │ missing     │ -0.386099 │ missing   │
│ 2   │ 2     │ -1.06118  │ missing     │ -0.464812 │ missing   │
│ 3   │ 3     │ 0.786299  │ missing     │ 0.168381  │ missing   │
│ 4   │ 4     │ 0.233934  │ missing     │ 0.546724  │ missing   │
│ 5   │ 5     │ -3.04186  │ missing     │ -1.38031  │ missing   │
│ 6   │ 6     │ -0.599674 │ missing     │ -1.5068   │ missing   │
│ 7   │ 7     │ missing   │ 1.23619     │ missing   │ 0.948665  │
│ 8   │ 8     │ missing   │ 1.12558     │ missing   │ -0.869007 │
│ 9   │ 9     │ missing   │ 0.521347    │ missing   │ 0.661981  │
│ 10  │ 10    │ missing   │ -0.00579468 │ missing   │ 1.54164   │
│ 11  │ 11    │ missing   │ 1.2072      │ missing   │ 1.41402   │
│ 12  │ 12    │ missing   │ -1.50068    │ missing   │ 0.67358   │

Can you file an issue in DataFrames? This would be a nice feature.

1 Like

Sure, I submitted an issue on DataFrames.

https://github.com/JuliaData/DataFrames.jl/issues/2405

If these transformations are not implemented in DataFrames, I think it would be good to emulate the R functions pivot_longer and pivot_wider which cover a variety of cases and are very general.

For me, this would be a nice use-case, but if you want to do it you can do this first

Way 1

# create multiple ones
dfs = unstack.(Ref(wide), Ref([:x, :a,:b]), Ref(:cname), [:val1, :val2])

reduce((df1, df2)->outerjoin(df1, df2, on=[:x, :a, :b], makeunique = true), dfs)

Way 2 Probably more efficient.

dfs = unstack.(Ref(wide), Ref([:x, :a,:b]), Ref(:cname), [:val1, :val2])

select!.(dfs[2:end], Ref(Not([:x, :a,:b])))
reduce((df1, df2) -> hcat(df1, df2, makeunique = true), dfs)
1 Like

Thank you @xiaodai! Following your implementation, I wrote the function to unstack with multiple name columns and multiple value columns. I am posting it here as it may help others before it its implemented in DataFrames.

This function is probably very inefficient performance-wise, as I am still learning and could not figure out some basic things. Any tips to improve it are appreciated.

function pivot_wider(df::AbstractDataFrame, names_cols, values_cols)
    function unstack_name(df::AbstractDataFrame, rowkeys, colkey, value)
        unstack(df::AbstractDataFrame, rowkeys, colkey, value, renamecols = x -> Symbol(string(value), "_", x))
    end
    id_cols = setdiff(Symbol.(names(df)), names_cols,values_cols)
    dfa = select(df,:)
    dfa[!,:cname_temp] = [join(Array(r),"_") for r in eachrow(select(dfa,names_cols))]
    dfb = unstack_name.(Ref(dfa), Ref(id_cols), Ref(:cname_temp), values_cols)
    id1 = select(dfb[1],id_cols)
    select!.(dfb, Ref(Not(id_cols)))
    bb = reduce((df1, df2) -> hcat(df1, df2, makeunique = true), dfb)
    hcat(id1,bb)
end

The example to test the function

wide = DataFrame(x = repeat(1:3,inner = 2,outer = 2),
       a = repeat(4:6,inner = 2,outer = 2),
       b = repeat(7:9,inner = 2,outer = 2),
       val1 = ["ce_val1_1","cf_val1_1","ce_val1_2","cf_val1_2","ce_val1_3","cf_val1_3","de_val1_1","df_val1_1","de_val1_2","df_val1_2","de_val1_3","df_val1_3"], 
       val2 = ["ce_val2_1","cf_val2_1","ce_val2_2","cf_val2_2","ce_val2_3","cf_val2_3","de_val2_1","df_val2_1","de_val2_2","df_val2_2","de_val2_3","df_val2_3"], 
       cname1 = repeat(["c", "d"], inner = 6),
       cname2 = repeat(["e", "f"], 6)
       )
12×7 DataFrame
│ Row │ x     │ a     │ b     │ val1      │ val2      │ cname1 │ cname2 │
│     │ Int64 │ Int64 │ Int64 │ String    │ String    │ String │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┼────────┤
│ 1   │ 1     │ 4     │ 7     │ ce_val1_1 │ ce_val2_1 │ c      │ e      │
│ 2   │ 1     │ 4     │ 7     │ cf_val1_1 │ cf_val2_1 │ c      │ f      │
│ 3   │ 2     │ 5     │ 8     │ ce_val1_2 │ ce_val2_2 │ c      │ e      │
│ 4   │ 2     │ 5     │ 8     │ cf_val1_2 │ cf_val2_2 │ c      │ f      │
│ 5   │ 3     │ 6     │ 9     │ ce_val1_3 │ ce_val2_3 │ c      │ e      │
│ 6   │ 3     │ 6     │ 9     │ cf_val1_3 │ cf_val2_3 │ c      │ f      │
│ 7   │ 1     │ 4     │ 7     │ de_val1_1 │ de_val2_1 │ d      │ e      │
│ 8   │ 1     │ 4     │ 7     │ df_val1_1 │ df_val2_1 │ d      │ f      │
│ 9   │ 2     │ 5     │ 8     │ de_val1_2 │ de_val2_2 │ d      │ e      │
│ 10  │ 2     │ 5     │ 8     │ df_val1_2 │ df_val2_2 │ d      │ f      │
│ 11  │ 3     │ 6     │ 9     │ de_val1_3 │ de_val2_3 │ d      │ e      │
│ 12  │ 3     │ 6     │ 9     │ df_val1_3 │ df_val2_3 │ d      │ f      │
pivot_wider(wide, [:cname1,:cname2], [:val1,:val2])
3×11 DataFrame
│ Row │ x     │ a     │ b     │ val1_c_e  │ val1_c_f  │ val1_d_e  │ val1_d_f  │ val2_c_e  │ val2_c_f  │ val2_d_e  │ val2_d_f  │
│     │ Int64 │ Int64 │ Int64 │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │ String?   │
├─────┼───────┼───────┼───────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ 1     │ 4     │ 7     │ ce_val1_1 │ cf_val1_1 │ de_val1_1 │ df_val1_1 │ ce_val2_1 │ cf_val2_1 │ de_val2_1 │ df_val2_1 │
│ 2   │ 2     │ 5     │ 8     │ ce_val1_2 │ cf_val1_2 │ de_val1_2 │ df_val1_2 │ ce_val2_2 │ cf_val2_2 │ de_val2_2 │ df_val2_2 │
│ 3   │ 3     │ 6     │ 9     │ ce_val1_3 │ cf_val1_3 │ de_val1_3 │ df_val1_3 │ ce_val2_3 │ cf_val2_3 │ de_val2_3 │ df_val2_3 │
2 Likes

I am not sure, but this could cause performance issues. Keep it outside as a separate function is better.

Since you like helping others, I would welcome a PR to https://github.com/xiaodaigh/Hadleyverse.jl and work thru it so that it’s available for others to use.

This is a great idea! Dplyr and tidyverse are the greatest strengths of R. Data processing is so much easier and concise using dplyr. The functions are well thought of in terms of functionality. I think it is a worthwhile project to emulate it. Happy to help! And nice name Hadleyverse :wink:

Well my first naming attempt was Dply.jl but packages can’t have less than 5 letters, so I had to make something unique but not take up some common word.

DplyJ.jl?

Package also can’t end in capital letter.

.jl already conveys Julia

(Sorry this discussion is entirely off topic, just one last point: The J would be consistent with MLJ, which as I understand it was inspired by the R package MLR, and is also a good example for a package with capital letters only and less than 5 letters. The naming conventions I think you’re referring to are just for the automatic merging of the registration of packages in the General registry, so you can deviate from them, it just required manual merging)

Yeah. I like to keep to auto rules. To lessen burden on maintainers. I guess MLJ.jl is nice.

@Rajesh_Talluri I’ve registered Hadleyverse 0.1.1 with pivot_wider See https://github.com/xiaodaigh/Hadleyverse.jl#pivot_wider

Thanks so much for helping me out! :+1: