How (best) to transform a huge DataFrame into wide-format

askvorts · December 3, 2024, 2:14pm

I have a DataFrame with columns like open, high, low, close, and vol for multiple tickers, e.g.

df1 = DataFrame(
date_time = repeat(["2024-12-02T14:30:00", "2024-12-02T16:00:00","2024-12-02T17:30:00", "2024-12-02T19:00:00" ], outer=2),
ticker = repeat(["AAPL", "IBM"], inner=4),
open = rand(8),
high = rand(8),
low = rand(8),
close = rand(8),
vol = rand(8)*10^5
)

 Row │ date                 ticker  open       high       low       close       vol      
     │ String               String  Float64    Float64    Float64   Float64     Float64  
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │ 2024-12-02T14:30:00  AAPL    0.689655   0.837242   0.636093  0.0616286    8079.08
   2 │ 2024-12-02T16:00:00  AAPL    0.70835    0.120729   0.922828  0.0278368   98275.5
   3 │ 2024-12-02T17:30:00  AAPL    0.235651   0.170414   0.854302  0.805576    26022.2
   4 │ 2024-12-02T19:00:00  AAPL    0.154155   0.977993   0.744772  0.563214    35684.5
   5 │ 2024-12-02T14:30:00  IBM     0.294663   0.70168    0.213208  0.00785374  61187.7
   6 │ 2024-12-02T16:00:00  IBM     0.740926   0.0221332  0.320625  0.102369    13345.5
   7 │ 2024-12-02T17:30:00  IBM     0.0352852  0.0763372  0.385503  0.998992    53140.9
   8 │ 2024-12-02T19:00:00  IBM     0.397259   0.606574   0.883001  0.180595    94851.4

Would like to transform the DataFrame (df1) into wide-format where these metrics for each timestamp are split by :ticker. In pseudo code:

df2 columns = [:date-time :AAPL_open, :AAPL_high, ... :IBM_close, :IBM_vol]

Which would be the best way (the real DataFrame is huge) to achieve this? Than you!

nilshg · December 3, 2024, 2:42pm

help?> unstack
search: unstack stack unlock unmark instances abstract issticky

  unstack(df::AbstractDataFrame, rowkeys, colkey, value;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)
  unstack(df::AbstractDataFrame, colkey, value;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)
  unstack(df::AbstractDataFrame;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)

  Unstack data frame df, i.e. convert it from long to wide format.

(...)

  Examples
  ≡≡≡≡≡≡≡≡

  julia> wide = DataFrame(id=1:6,
                          a=repeat(1:3, inner=2),
                          b=repeat(1.0:2.0, inner=3),
                          c=repeat(1.0:1.0, inner=6),
                          d=repeat(1.0:3.0, inner=2))
  6×5 DataFrame
   Row │ id     a      b        c        d
       │ Int64  Int64  Float64  Float64  Float64
  ─────┼─────────────────────────────────────────
     1 │     1      1      1.0      1.0      1.0
     2 │     2      1      1.0      1.0      1.0
     3 │     3      2      1.0      1.0      2.0
     4 │     4      2      2.0      1.0      2.0
     5 │     5      3      2.0      1.0      3.0
     6 │     6      3      2.0      1.0      3.0

  julia> long = stack(wide)
  18×4 DataFrame
   Row │ id     a      variable  value
       │ Int64  Int64  String    Float64
  ─────┼─────────────────────────────────
     1 │     1      1  b             1.0
     2 │     2      1  b             1.0
     3 │     3      2  b             1.0
     4 │     4      2  b             2.0
     5 │     5      3  b             2.0
     6 │     6      3  b             2.0
     7 │     1      1  c             1.0
     8 │     2      1  c             1.0
    ⋮  │   ⋮      ⋮       ⋮         ⋮
    12 │     6      3  c             1.0
    13 │     1      1  d             1.0
    14 │     2      1  d             1.0
    15 │     3      2  d             2.0
    16 │     4      2  d             2.0
    17 │     5      3  d             3.0
    18 │     6      3  d             3.0
                           3 rows omitted

  julia> unstack(long)
  6×5 DataFrame
   Row │ id     a      b         c         d
       │ Int64  Int64  Float64?  Float64?  Float64?
  ─────┼────────────────────────────────────────────
     1 │     1      1       1.0       1.0       1.0
     2 │     2      1       1.0       1.0       1.0
     3 │     3      2       1.0       1.0       2.0
     4 │     4      2       2.0       1.0       2.0
     5 │     5      3       2.0       1.0       3.0
     6 │     6      3       2.0       1.0       3.0

rocco_sprmnt21 · December 3, 2024, 3:58pm

Is this what you’re looking for?

dfg=groupby(df1,:ticker)
r(g)=rename(DataFrame(g), (names(g[:,Not(:ticker)]).=>g[1,:ticker]*"_" .*names(g[:,Not(:ticker)]))...)
reduce((h,g)->hcat(r(h),r(g), makeunique=true), dfg )

or

r1(g)=rename(DataFrame(g), (names(g).=>g[1,:ticker]*"_" .*names(g))...)
reduce((h,g)->hcat(r1(h),r1(g)), dfg )

askvorts · December 3, 2024, 4:00pm

Thank you for your reply. I am aware of the unstack() method, my question was more about how to apply it to achieve my desired objective.

pdeffebach · December 3, 2024, 4:12pm

askvorts:

df1 = DataFrame(
date_time = repeat(["2024-12-02T14:30:00", "2024-12-02T16:00:00","2024-12-02T17:30:00", "2024-12-02T19:00:00" ], outer=2),
ticker = repeat(["AAPL", "IBM"], inner=4),
open = rand(8),
high = rand(8),
low = rand(8),
close = rand(8),
vol = rand(8)*10^5
)

Unstack doesn’t work for multiple columns. So you will need something with a loop, unfortunately.

julia> reduce(gd; init = DataFrame(date_time = unique(df1.date_time))) do df_l, df_r
           t_r = first(df_r.ticker)
           df_r = select(df_r, Not(:ticker))
           df_r = rename(df_r) do n
               n == "date_time" && return n
               n * "_" * t_r
           end
           leftjoin(df_l, df_r; on = :date_time)
       end
4×11 DataFrame
 Row │ date_time            open_AAPL  high_AAPL  low_AAPL  close_AAPL  vol_AAPL  open_IBM   high_IBM  low_IBM   close_IBM  vol_IBM  
     │ String               Float64?   Float64?   Float64?  Float64?    Float64?  Float64?   Float64?  Float64?  Float64?   Float64? 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2024-12-02T14:30:00   0.503531  0.952858   0.542719    0.643992   82669.2  0.814595   0.872334  0.3809     0.227004   66136.0
   2 │ 2024-12-02T16:00:00   0.943105  0.858138   0.715906    0.484657   91041.3  0.45135    0.522677  0.104943   0.315722   54555.6
   3 │ 2024-12-02T17:30:00   0.190797  0.0412094  0.347572    0.539557   85581.2  0.0269947  0.250749  0.328845   0.482173   22247.9
   4 │ 2024-12-02T19:00:00   0.706489  0.0853058  0.935489    0.470801   74114.2  0.238969   0.921508  0.187803   0.299688   54766.8

Dan · December 3, 2024, 5:06pm

With the df1 in the OP. The following:

unstack(
  select(
    stack(df1, Not([:date_time, :ticker])), 
    :date_time, 
    [:ticker, :variable] => ByRow((x,y)->"$(x)_$(y)") => :col, 
    :value),
  :date_time, :col, :value
)

gives:

4×11 DataFrame
 Row │ date_time            AAPL_open  IBM_open  AAPL_high  IBM_high  AAPL_low  IBM_low   AAPL_close  IBM_close   ⋯
     │ String               Float64?   Float64?  Float64?   Float64?  Float64?  Float64?  Float64?    Float64?    ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 2024-12-02T14:30:00   0.941412  0.207643   0.879879  0.826008  0.775218  0.879673    0.486391   0.293913   ⋯
   2 │ 2024-12-02T16:00:00   0.952736  0.763155   0.542305  0.937118  0.104578  0.26724     0.992486   0.920629
   3 │ 2024-12-02T17:30:00   0.692002  0.959982   0.408699  0.587018  0.777649  0.207722    0.67562    0.57377
   4 │ 2024-12-02T19:00:00   0.639045  0.913077   0.286222  0.274339  0.339372  0.637027    0.564277   0.371248

Combining stack, unstack and select can really do most of the pivoting wizardry from Excel and friends.

nilshg · December 3, 2024, 5:09pm

And unstack here would work like this:

reduce((x,y) -> leftjoin(x, y, on = :date_time), 
    unstack(df1[:, ["date_time", "ticker", y]], "ticker", y, 
          renamecols = x -> x*"_"*y) for y ∈ names(df1)[3:end])

pdeffebach · December 3, 2024, 5:43pm

Ugh this is the “gather-spread” pattern that dplyr recommends for this kind of thing. Hopefully we can make a better API for this soon.

rocco_sprmnt21 · December 4, 2024, 10:14am

a similar approach (exactly reversed, in fact) for another problem

Topic		Replies	Views
New DataFrame whose columns are values of a column and grouped by another column General Usage dataframes	2	371	May 14, 2021
Pivot a dataframe to wide format with values in multiple columns New to Julia question , dataframes	15	2022	September 10, 2020
How to reshape wide to long? General Usage	13	1129	June 24, 2022
Unstack using row numbers Data question , dataframes	5	807	August 16, 2021
Best practice to unstack a dataframe with a lot of columnes? General Usage question , dataframes	5	1340	March 11, 2020

How (best) to transform a huge DataFrame into wide-format

Related topics