How (best) to transform a huge DataFrame into wide-format

I have a DataFrame with columns like open, high, low, close, and vol for multiple tickers, e.g.

df1 = DataFrame(
date_time = repeat(["2024-12-02T14:30:00", "2024-12-02T16:00:00","2024-12-02T17:30:00", "2024-12-02T19:00:00" ], outer=2),
ticker = repeat(["AAPL", "IBM"], inner=4),
open = rand(8),
high = rand(8),
low = rand(8),
close = rand(8),
vol = rand(8)*10^5
)

 Row โ”‚ date                 ticker  open       high       low       close       vol      
     โ”‚ String               String  Float64    Float64    Float64   Float64     Float64  
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 2024-12-02T14:30:00  AAPL    0.689655   0.837242   0.636093  0.0616286    8079.08
   2 โ”‚ 2024-12-02T16:00:00  AAPL    0.70835    0.120729   0.922828  0.0278368   98275.5
   3 โ”‚ 2024-12-02T17:30:00  AAPL    0.235651   0.170414   0.854302  0.805576    26022.2
   4 โ”‚ 2024-12-02T19:00:00  AAPL    0.154155   0.977993   0.744772  0.563214    35684.5
   5 โ”‚ 2024-12-02T14:30:00  IBM     0.294663   0.70168    0.213208  0.00785374  61187.7
   6 โ”‚ 2024-12-02T16:00:00  IBM     0.740926   0.0221332  0.320625  0.102369    13345.5
   7 โ”‚ 2024-12-02T17:30:00  IBM     0.0352852  0.0763372  0.385503  0.998992    53140.9
   8 โ”‚ 2024-12-02T19:00:00  IBM     0.397259   0.606574   0.883001  0.180595    94851.4

Would like to transform the DataFrame (df1) into wide-format where these metrics for each timestamp are split by :ticker. In pseudo code:

df2 columns = [:date-time :AAPL_open, :AAPL_high, ... :IBM_close, :IBM_vol]

Which would be the best way (the real DataFrame is huge) to achieve this? Than you!

help?> unstack
search: unstack stack unlock unmark instances abstract issticky

  unstack(df::AbstractDataFrame, rowkeys, colkey, value;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)
  unstack(df::AbstractDataFrame, colkey, value;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)
  unstack(df::AbstractDataFrame;
          renamecols::Function=identity, allowmissing::Bool=false,
          combine=only, fill=missing, threads::Bool=true)

  Unstack data frame df, i.e. convert it from long to wide format.

(...)

  Examples
  โ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰กโ‰ก

  julia> wide = DataFrame(id=1:6,
                          a=repeat(1:3, inner=2),
                          b=repeat(1.0:2.0, inner=3),
                          c=repeat(1.0:1.0, inner=6),
                          d=repeat(1.0:3.0, inner=2))
  6ร—5 DataFrame
   Row โ”‚ id     a      b        c        d
       โ”‚ Int64  Int64  Float64  Float64  Float64
  โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚     1      1      1.0      1.0      1.0
     2 โ”‚     2      1      1.0      1.0      1.0
     3 โ”‚     3      2      1.0      1.0      2.0
     4 โ”‚     4      2      2.0      1.0      2.0
     5 โ”‚     5      3      2.0      1.0      3.0
     6 โ”‚     6      3      2.0      1.0      3.0

  julia> long = stack(wide)
  18ร—4 DataFrame
   Row โ”‚ id     a      variable  value
       โ”‚ Int64  Int64  String    Float64
  โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚     1      1  b             1.0
     2 โ”‚     2      1  b             1.0
     3 โ”‚     3      2  b             1.0
     4 โ”‚     4      2  b             2.0
     5 โ”‚     5      3  b             2.0
     6 โ”‚     6      3  b             2.0
     7 โ”‚     1      1  c             1.0
     8 โ”‚     2      1  c             1.0
    โ‹ฎ  โ”‚   โ‹ฎ      โ‹ฎ       โ‹ฎ         โ‹ฎ
    12 โ”‚     6      3  c             1.0
    13 โ”‚     1      1  d             1.0
    14 โ”‚     2      1  d             1.0
    15 โ”‚     3      2  d             2.0
    16 โ”‚     4      2  d             2.0
    17 โ”‚     5      3  d             3.0
    18 โ”‚     6      3  d             3.0
                           3 rows omitted

  julia> unstack(long)
  6ร—5 DataFrame
   Row โ”‚ id     a      b         c         d
       โ”‚ Int64  Int64  Float64?  Float64?  Float64?
  โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚     1      1       1.0       1.0       1.0
     2 โ”‚     2      1       1.0       1.0       1.0
     3 โ”‚     3      2       1.0       1.0       2.0
     4 โ”‚     4      2       2.0       1.0       2.0
     5 โ”‚     5      3       2.0       1.0       3.0
     6 โ”‚     6      3       2.0       1.0       3.0

Is this what youโ€™re looking for?

dfg=groupby(df1,:ticker)
r(g)=rename(DataFrame(g), (names(g[:,Not(:ticker)]).=>g[1,:ticker]*"_" .*names(g[:,Not(:ticker)]))...)
reduce((h,g)->hcat(r(h),r(g), makeunique=true), dfg )

or

r1(g)=rename(DataFrame(g), (names(g).=>g[1,:ticker]*"_" .*names(g))...)
reduce((h,g)->hcat(r1(h),r1(g)), dfg )
1 Like

Thank you for your reply. I am aware of the unstack() method, my question was more about how to apply it to achieve my desired objective.

Unstack doesnโ€™t work for multiple columns. So you will need something with a loop, unfortunately.

julia> reduce(gd; init = DataFrame(date_time = unique(df1.date_time))) do df_l, df_r
           t_r = first(df_r.ticker)
           df_r = select(df_r, Not(:ticker))
           df_r = rename(df_r) do n
               n == "date_time" && return n
               n * "_" * t_r
           end
           leftjoin(df_l, df_r; on = :date_time)
       end
4ร—11 DataFrame
 Row โ”‚ date_time            open_AAPL  high_AAPL  low_AAPL  close_AAPL  vol_AAPL  open_IBM   high_IBM  low_IBM   close_IBM  vol_IBM  
     โ”‚ String               Float64?   Float64?   Float64?  Float64?    Float64?  Float64?   Float64?  Float64?  Float64?   Float64? 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 2024-12-02T14:30:00   0.503531  0.952858   0.542719    0.643992   82669.2  0.814595   0.872334  0.3809     0.227004   66136.0
   2 โ”‚ 2024-12-02T16:00:00   0.943105  0.858138   0.715906    0.484657   91041.3  0.45135    0.522677  0.104943   0.315722   54555.6
   3 โ”‚ 2024-12-02T17:30:00   0.190797  0.0412094  0.347572    0.539557   85581.2  0.0269947  0.250749  0.328845   0.482173   22247.9
   4 โ”‚ 2024-12-02T19:00:00   0.706489  0.0853058  0.935489    0.470801   74114.2  0.238969   0.921508  0.187803   0.299688   54766.8
1 Like

With the df1 in the OP. The following:

unstack(
  select(
    stack(df1, Not([:date_time, :ticker])), 
    :date_time, 
    [:ticker, :variable] => ByRow((x,y)->"$(x)_$(y)") => :col, 
    :value),
  :date_time, :col, :value
)

gives:

4ร—11 DataFrame
 Row โ”‚ date_time            AAPL_open  IBM_open  AAPL_high  IBM_high  AAPL_low  IBM_low   AAPL_close  IBM_close   โ‹ฏ
     โ”‚ String               Float64?   Float64?  Float64?   Float64?  Float64?  Float64?  Float64?    Float64?    โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 2024-12-02T14:30:00   0.941412  0.207643   0.879879  0.826008  0.775218  0.879673    0.486391   0.293913   โ‹ฏ
   2 โ”‚ 2024-12-02T16:00:00   0.952736  0.763155   0.542305  0.937118  0.104578  0.26724     0.992486   0.920629
   3 โ”‚ 2024-12-02T17:30:00   0.692002  0.959982   0.408699  0.587018  0.777649  0.207722    0.67562    0.57377
   4 โ”‚ 2024-12-02T19:00:00   0.639045  0.913077   0.286222  0.274339  0.339372  0.637027    0.564277   0.371248

Combining stack, unstack and select can really do most of the pivoting wizardry from Excel and friends.

5 Likes

And unstack here would work like this:

reduce((x,y) -> leftjoin(x, y, on = :date_time), 
    unstack(df1[:, ["date_time", "ticker", y]], "ticker", y, 
          renamecols = x -> x*"_"*y) for y โˆˆ names(df1)[3:end])
2 Likes

Ugh this is the โ€œgather-spreadโ€ pattern that dplyr recommends for this kind of thing. Hopefully we can make a better API for this soon.

1 Like

a similar approach (exactly reversed, in fact) for another problem

1 Like