Lag vector by group using another vector as the grouping variable

rubaiyat · February 19, 2022, 5:37pm

Hello everyone,

I was wondering if it were possible to lag a vector by group, when the grouping variable is a separate vector.

Here’s a minimum working example:

firm_id = ["A", "A", "A", "A", "B", "B", "B", "B"] 
revenue = [100, 200, 300, 400, 50, 67, 75, 90]
year = [2001,2002,2003,2004,2001,2002,2003,2004]

where you can think of the revenue as a panel data that corresponds to each firm.

I’d like to lag the revenue variable by firm_id, to get the following result:

revenue_lag = [missing, 100, 200, 300, missing, 50, 67, 75]

In my project I have to perform this operation repeatedly for different values of the parameter I’m optimizing over.

I realize that this can be done by putting firm_id and revenue into a DataFrame and then lagging it, but I was wondering if it were possible to do so without creating a DataFrame every time I want to do this. My (uninformed) guess is that creating a DataFrame everytime to lag is more time-consuming, please correct me if I’m wrong about this.

I saw that the ShiftedArrays package has a lag() function but couldn’t figure out how it could be applied by a grouping vector.

Thanks!

tbeason · February 19, 2022, 5:59pm

No, I would definitely use DataFrames and ShiftedArrays. Probably with Chain or another similar piping package.

nilshg · February 19, 2022, 6:12pm

Agree, just make sure you don’t copy everything when you create the DataFrame:

julia> using DataFrames, BenchmarkTools

julia> firm_id = rand(["A", "B"], 100_000); revenue = rand(Int, 100_000); year = rand(2001:2004, 100_000);

julia> @btime DataFrame(firm_id = $firm_id, revenue = $revenue, year = $year);
  122.900 μs (34 allocations: 2.29 MiB)

julia> @btime DataFrame(firm_id = $firm_id, revenue = $revenue, year = $year; copycols = false);
  2.140 μs (27 allocations: 1.80 KiB)

rubaiyat · February 20, 2022, 1:41am

Ah that’s a nice trick, thanks!

George9000 · February 20, 2022, 1:46am

using DataFrames, ShiftedArrays
using Chain

firm_id = ["A", "A", "A", "A", "B", "B", "B", "B"]
revenue = [100, 200, 300, 400, 50, 67, 75, 90]
year = [2001,2002,2003,2004,2001,2002,2003,2004]

df = DataFrame(;firm_id, revenue, year, copycols = false)

@chain df begin
    groupby(:firm_id)
    transform(:revenue => lag => :revenue_lag, ungroup = false)
    transform([:revenue, :revenue_lag] => ByRow((x, y) -> x - y) => :diff_prior_year, ungroup = false) 
end

## avoid anonymous function for better performance

diff(x, y) = x - y

@chain df begin
    groupby(:firm_id)
    transform(:revenue => lag => :revenue_lag, ungroup = false)
    transform([:revenue, :revenue_lag] => ByRow(diff) => :diff_prior_year, ungroup = false) 
end

Topic		Replies	Views
Not able to use lag and lead function New to Julia package , dataframes	1	662	December 2, 2022
Multiple lag() in TimeSeries doesn't work in one line Data question , dataframes , time-series	1	533	September 6, 2020
Create lead and lag variable in DataFrame General Usage question	14	8071	October 22, 2019
Lag/lead in panel data Data dataframes	18	3357	March 22, 2022
Add column and column names of variable lags to dataframe New to Julia dataframes	5	485	September 1, 2022

Lag vector by group using another vector as the grouping variable

Related topics