I have a DataFrame with columns like open, high, low, close, and vol for multiple tickers, e.g.
df1 = DataFrame(
date_time = repeat(["2024-12-02T14:30:00", "2024-12-02T16:00:00","2024-12-02T17:30:00", "2024-12-02T19:00:00" ], outer=2),
ticker = repeat(["AAPL", "IBM"], inner=4),
open = rand(8),
high = rand(8),
low = rand(8),
close = rand(8),
vol = rand(8)*10^5
)
Row โ date ticker open high low close vol
โ String String Float64 Float64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 2024-12-02T14:30:00 AAPL 0.689655 0.837242 0.636093 0.0616286 8079.08
2 โ 2024-12-02T16:00:00 AAPL 0.70835 0.120729 0.922828 0.0278368 98275.5
3 โ 2024-12-02T17:30:00 AAPL 0.235651 0.170414 0.854302 0.805576 26022.2
4 โ 2024-12-02T19:00:00 AAPL 0.154155 0.977993 0.744772 0.563214 35684.5
5 โ 2024-12-02T14:30:00 IBM 0.294663 0.70168 0.213208 0.00785374 61187.7
6 โ 2024-12-02T16:00:00 IBM 0.740926 0.0221332 0.320625 0.102369 13345.5
7 โ 2024-12-02T17:30:00 IBM 0.0352852 0.0763372 0.385503 0.998992 53140.9
8 โ 2024-12-02T19:00:00 IBM 0.397259 0.606574 0.883001 0.180595 94851.4
Would like to transform the DataFrame (df1) into wide-format where these metrics for each timestamp are split by :ticker. In pseudo code:
df2 columns = [:date-time :AAPL_open, :AAPL_high, ... :IBM_close, :IBM_vol]
Which would be the best way (the real DataFrame is huge) to achieve this? Than you!
nilshg
December 3, 2024, 2:42pm
2
help?> unstack
search: unstack stack unlock unmark instances abstract issticky
unstack(df::AbstractDataFrame, rowkeys, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame, colkey, value;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
unstack(df::AbstractDataFrame;
renamecols::Function=identity, allowmissing::Bool=false,
combine=only, fill=missing, threads::Bool=true)
Unstack data frame df, i.e. convert it from long to wide format.
(...)
Examples
โกโกโกโกโกโกโกโก
julia> wide = DataFrame(id=1:6,
a=repeat(1:3, inner=2),
b=repeat(1.0:2.0, inner=3),
c=repeat(1.0:1.0, inner=6),
d=repeat(1.0:3.0, inner=2))
6ร5 DataFrame
Row โ id a b c d
โ Int64 Int64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 1.0 1.0 1.0
2 โ 2 1 1.0 1.0 1.0
3 โ 3 2 1.0 1.0 2.0
4 โ 4 2 2.0 1.0 2.0
5 โ 5 3 2.0 1.0 3.0
6 โ 6 3 2.0 1.0 3.0
julia> long = stack(wide)
18ร4 DataFrame
Row โ id a variable value
โ Int64 Int64 String Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 b 1.0
2 โ 2 1 b 1.0
3 โ 3 2 b 1.0
4 โ 4 2 b 2.0
5 โ 5 3 b 2.0
6 โ 6 3 b 2.0
7 โ 1 1 c 1.0
8 โ 2 1 c 1.0
โฎ โ โฎ โฎ โฎ โฎ
12 โ 6 3 c 1.0
13 โ 1 1 d 1.0
14 โ 2 1 d 1.0
15 โ 3 2 d 2.0
16 โ 4 2 d 2.0
17 โ 5 3 d 3.0
18 โ 6 3 d 3.0
3 rows omitted
julia> unstack(long)
6ร5 DataFrame
Row โ id a b c d
โ Int64 Int64 Float64? Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 1.0 1.0 1.0
2 โ 2 1 1.0 1.0 1.0
3 โ 3 2 1.0 1.0 2.0
4 โ 4 2 2.0 1.0 2.0
5 โ 5 3 2.0 1.0 3.0
6 โ 6 3 2.0 1.0 3.0
Is this what youโre looking for?
dfg=groupby(df1,:ticker)
r(g)=rename(DataFrame(g), (names(g[:,Not(:ticker)]).=>g[1,:ticker]*"_" .*names(g[:,Not(:ticker)]))...)
reduce((h,g)->hcat(r(h),r(g), makeunique=true), dfg )
or
r1(g)=rename(DataFrame(g), (names(g).=>g[1,:ticker]*"_" .*names(g))...)
reduce((h,g)->hcat(r1(h),r1(g)), dfg )
1 Like
Thank you for your reply. I am aware of the unstack() method, my question was more about how to apply it to achieve my desired objective.
askvorts:
df1 = DataFrame(
date_time = repeat(["2024-12-02T14:30:00", "2024-12-02T16:00:00","2024-12-02T17:30:00", "2024-12-02T19:00:00" ], outer=2),
ticker = repeat(["AAPL", "IBM"], inner=4),
open = rand(8),
high = rand(8),
low = rand(8),
close = rand(8),
vol = rand(8)*10^5
)
Unstack doesnโt work for multiple columns. So you will need something with a loop, unfortunately.
julia> reduce(gd; init = DataFrame(date_time = unique(df1.date_time))) do df_l, df_r
t_r = first(df_r.ticker)
df_r = select(df_r, Not(:ticker))
df_r = rename(df_r) do n
n == "date_time" && return n
n * "_" * t_r
end
leftjoin(df_l, df_r; on = :date_time)
end
4ร11 DataFrame
Row โ date_time open_AAPL high_AAPL low_AAPL close_AAPL vol_AAPL open_IBM high_IBM low_IBM close_IBM vol_IBM
โ String Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 2024-12-02T14:30:00 0.503531 0.952858 0.542719 0.643992 82669.2 0.814595 0.872334 0.3809 0.227004 66136.0
2 โ 2024-12-02T16:00:00 0.943105 0.858138 0.715906 0.484657 91041.3 0.45135 0.522677 0.104943 0.315722 54555.6
3 โ 2024-12-02T17:30:00 0.190797 0.0412094 0.347572 0.539557 85581.2 0.0269947 0.250749 0.328845 0.482173 22247.9
4 โ 2024-12-02T19:00:00 0.706489 0.0853058 0.935489 0.470801 74114.2 0.238969 0.921508 0.187803 0.299688 54766.8
1 Like
Dan
December 3, 2024, 5:06pm
6
With the df1
in the OP. The following:
unstack(
select(
stack(df1, Not([:date_time, :ticker])),
:date_time,
[:ticker, :variable] => ByRow((x,y)->"$(x)_$(y)") => :col,
:value),
:date_time, :col, :value
)
gives:
4ร11 DataFrame
Row โ date_time AAPL_open IBM_open AAPL_high IBM_high AAPL_low IBM_low AAPL_close IBM_close โฏ
โ String Float64? Float64? Float64? Float64? Float64? Float64? Float64? Float64? โฏ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 2024-12-02T14:30:00 0.941412 0.207643 0.879879 0.826008 0.775218 0.879673 0.486391 0.293913 โฏ
2 โ 2024-12-02T16:00:00 0.952736 0.763155 0.542305 0.937118 0.104578 0.26724 0.992486 0.920629
3 โ 2024-12-02T17:30:00 0.692002 0.959982 0.408699 0.587018 0.777649 0.207722 0.67562 0.57377
4 โ 2024-12-02T19:00:00 0.639045 0.913077 0.286222 0.274339 0.339372 0.637027 0.564277 0.371248
Combining stack
, unstack
and select
can really do most of the pivoting wizardry from Excel and friends.
5 Likes
nilshg
December 3, 2024, 5:09pm
7
And unstack
here would work like this:
reduce((x,y) -> leftjoin(x, y, on = :date_time),
unstack(df1[:, ["date_time", "ticker", y]], "ticker", y,
renamecols = x -> x*"_"*y) for y โ names(df1)[3:end])
2 Likes
Ugh this is the โgather-spreadโ pattern that dplyr recommends for this kind of thing. Hopefully we can make a better API for this soon.
1 Like
a similar approach (exactly reversed, in fact) for another problem
1 Like