Create lead and lag variable in DataFrame

I am trying to create a lag and lead variable in DataFrame, in R and Python this can be easily done with lag, lead, and shift function, but I still could not get it done in Julia.

My code is like this (does not work):

samplefine_call = @>begin
    samplefine_call
    @transform( price = blsprice.(:S, :K, :r, :T, :σ, :DIV) )
    @transform(RND = lead(:price,1) - lag(:price, 1))
end

I can do it like this, but I need to merge it back to my original DataFrame, which is not efficient


RND = samplefine_call[3:end, :price] - samplefine_call[1:end-2, :price]

Take a look at https://github.com/JuliaEconometrics/EconUtils.jl/blob/master/src/firstdifference.jl.

The implementation I wrote is more general as it is meant to be used with panels and handle the nuance of steps (date-time, gaps, implied minimum step, etc.). I would recommend using diff if the case is simple enough.

Thanks for the reply. In my case (not the example I gave here), I need to calculate price [i + 1] - 2*price [i] + price[i - 1], which equals price [i+1] - price[i] - (price [i] - price[i-1]) which is double difference, which can use your function. However, I still want to know if a window function like lead and lag is available with dataframe in Julia, like what dplyr does in R, because that approach is more flexible.

https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html

See https://github.com/JuliaData/DataFrames.jl/issues/791.

It seems the lag and lead functions in TimeSeries.jl only work on timeseries arrary

I think the lead and lag functions in the recent ShiftedArray package should suit exactly this need. We should really include these features by default with DataFrames in one way or another.

Cc: @piever

2 Likes

Yep, you could do something like:

using ShiftedArrays
v = lead(price) .- 2 .* price .+ lag(price)

which avoids unnecessary allocations. I think it works best on Julia 0.7 due to recent improvements on missing data handling by @nalimilan, but if you don’t think you’ll encounter performance issues it should be usable already.

To integrate this functionality in DataFrames, I guess the easiest is to add ShiftedArrays as a DataFrames dependency (ShiftedArrays is a small pure Julia package) and reexport lead and lag.

This probably isn’t the advice you’re looking for in this particular instance, but, for what it’s worth, I thought I should point out that a huge advantage that Julia has over Python and R in cases like this is that pure Julia code is actually efficient. So, if you were to do, for example, [v[i] - v[i-1] for i ∈ 2:length(v)] what you get will more or less run like C code. Furthermore, since DataFrames columns are just AbstractArrays, you almost never need to worry about them having some sort of unorthodox behavior. In Julia the need for specialized or fancy code for data manipulation is much mitigated compared to Python and R. (This also means that, as much as I love DataFramesMeta.jl, it often simply isn’t necessary.)

4 Likes

I agree on the conceptual distinction between working with Julia (arrays are naturally fast, a DataFrame is a collection of arrays of the same length) vs Pandas (to do something efficiently you often need a specialized function). However, in this case there is a different concern: it is annoying for newcomers to do this without off by one mistakes. Here for example, you get a column that is too short for a DataFrame, you should instead do:

[i==1 ? missing : v[i] - v[i-1] for i ∈ 1:length(v)]

or in the case of the question asked:

[i <= 2 ? missing : v[i] - 2*v[i-1] + v[i] for i ∈ 1:length(v)]

ShiftedArrays simply automatizes this procedure by automatically giving missing when you’re out of bounds (reasonably easy to implement in Julia thanks to the wonderful AbstractArray interface).

2 Likes

Like I said, this is not necessarily the very best use case, but when doing these sorts of things myself I have found it very useful to keep in mind that you don’t have to do everything with specialized code. It’s a very liberating feeling.

2 Likes

Thanks for your reply. I tried to install this package, but could not get it done. Could you show how to use these functions?

Does the following not work for you?

julia> Pkg.add("ShiftedArrays") # run only once to install the package
julia> using ShiftedArrays

julia> price = rand(10);

julia> v = price .- lag(price)
10-element Array{Any,1}:
   missing  
 -0.00216921
 -0.0259962 
 -0.078662  
 -0.766035  
  0.535359  
 -0.535341  
 -0.0775602 
  0.483483  
 -0.438419

What error do you get?

To use it with DataFrames, make sure you have at least version 0.11 as it is the first one to support missing

2 Likes

Thanks, it works now! I did this this morning, but could not install this package.

1 Like

I found circshift() which works great for this. It seems to be a part of 0.6.3.

df[:closelag] = circshift(df[:close], (-1))

The shiftedArrays tight notation is what I really like here, though I love the speed of native julia loops.

In my use-case its good to replicate the notation in this paper I’m working from.