Lag/lead in panel data

RangeFu · October 11, 2020, 3:56am

I just transitted from Stata/Pandas to DataFrames.jl. I was wondering what the best practice is to deal with panel data in Julia.

One of the common manipulations for panel data is the lag/lead operator for each individual. At the first glance it seems we could use transform(groupby(dt, :id), val => lag), as a simple extension of time series lag.

However often the time the data we use has some irregular missing, like:

dt = DataFrame(
    id = [1,1,1,2,2,2],                 
    t = [1,2,4, 1,2,4],
    val = randn(6)
)

so id1 is missing at t = 3 and id2 is missing at time 2. Due to the different time step across rows, the simple groupby-lag approach doesn’t work. We need to fill the gap first (like tsfill command in state).

Still, the split-apply-combine strategy is not very efficient here. If we sort the data by id-t and fill the gap, the lag/lead is monotone across individuals except the first/last observations for each individual, so we could shift the whole vector once and remove the first/last observations correspondingly. This leads me to the following routine to manage panel data:

First fill the gap in the data using groupby-combine strategy:

function paneltfill(dt, id, t, Δt)
    sort!(dt, [id, t])
    allowmissing!(dt)
    dtg = groupby(dt, id)
    dt = vcat([_sdf_tfill(sdf, id, t, Δt) for sdf in dtg]...)
end

function _sdf_tfill(sdf, id, t, Δt)
    sdf = copy(sdf)
    mint, maxt = minimum(sdf[t]), maximum(sdf[t])
    fullindex = [mint:Δt:maxt;]
    idval = sdf[id][1]
    for row in 1:length(fullindex)
        if sdf[row, t] != fullindex[row]
            insert!.(eachcol(sdf), row, missing)
            sdf[row, t] = fullindex[row]
            sdf[row, id] = idval
        end
    end
    return sdf
end

Then construct columns _index and _index_max to keep track of the order of obs. within each individual. We need new columns other than t as oftentimes individuals enter and exit the panel data at different time, so we need a normalized column with the first observation being 1 and the last one recorded in another column.

function setpanelindex!(dt, id, t; indexvar = :_index)
    maxindexvar = Symbol(string(indexvar) * "_max")
    sort!(dt, [id, t])
    @pipe groupby(dt, id) |> transform!(_, t => (x->[1:length(x);]) => indexvar)
    @pipe groupby(dt, id) |> transform!(_, t => length => maxindexvar)
    return dt
end

Now we could do lag/lead/diffs efficiently using the index columns:


function panelshift(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
    maxindexvar = Symbol(string(indexvar) * "_max")
    if reindex || indexvar ∉ propertynames(dt)
        setpanelindex!(dt, id, t, indexvar = indexvar)
    end
    newx = Vector(lag(dt[x], Δ))
    if Δ>0
        newx[dt[indexvar].<=Δ] .= missing
    else
        newx[dt[indexvar].> dt[maxindexvar] .+ Δ] .= missing
    end
    return newx
end

function paneldiff(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
    Lx = panelshift(dt, x, id, t, Δ = Δ, reindex = reindex, indexvar = indexvar)
    return dt[x] - Lx
end

This approach works very efficient, as compared to reindex and shift in pandas. It feels so good not being restricted by the provided APIs but can do anything without worrying about performance

The drawbacks of this approach are also very clear, but hopefully not that restrictive: I cannot drop observations that would create gaps in the time series, so if I want to exclude some observations I have to replace data with missing or use a flag column; I also have to stick to the sorting given by id-t pairs, and anytime I change the timespan of the sample I have to reset indexes. This seems to be a potential pitfall for bugs when the project gets larger.

As I’m new to the DataFrames.jl framework, the code above might not be very idiomatic or optimized, so I welcome any comments/critiques on the implementation. Any high-level discussions on the workflow of dealing panel data are also highly appreciated!

matthieu · October 11, 2020, 5:03am

A simpler approach is to define a lag function that respects time gaps, and then use transform to apply this function within groups. Comment this PR if you are interested: https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37

juliohm · October 11, 2020, 11:21am

Are you aware of TimeSeries.jl? They have lag and lead operators. The time series object they define also implement the Tables.jl api in case you need to pass it to other functions in the ecosystem.

RangeFu · October 11, 2020, 5:26pm

Thanks! Yes that would be ideal. Great to know it was already under consideration!

RangeFu · October 11, 2020, 5:31pm

Thanks. AFAIK it won’t work for panel data. Vectors in panel data generally have duplicated timestamps for different individuals, which is not supported by TimeSeries.jl.

pdeffebach · October 11, 2020, 5:46pm

I haven’t tried this personally, but you should be able to do a combine where you make your sub-dataframes a TimeArray, perform operations on them, then at the end it becomes a data frame again.

combine(groupby(df, :id)) do sdf
    TA = TimeArray(sdf, timestamp = :date)
    ### operations
    DataFrame(TA)
end

juliohm · October 11, 2020, 11:26pm

I don’t have experience with panel data. In fact, it is the first time I hear about it. Is it a table with more than one index? Perhaps you can create a package with this abstraction if it is useful in econometrics and other fields.

RangeFu · October 12, 2020, 4:27am

Yes it seems a workable idea. The overhead between type transformation might be too large, so probably we need another dedicated method to do the something. Thanks for suggestions!

RangeFu · October 12, 2020, 4:32am

Yeah basically a dataset indexed by both time and individual ids. It’s commonly used in econometrics. The major unique feature of panel data is the lag/lead operators discussed in this thread. Actually most of the manipulation on panel data can be done by groupby-combine strategy, but for the most efficiency probably one day indeed we need to write down a new data type for it. So far I enjoy using DataFrames a lot.

pdeffebach · October 12, 2020, 1:25pm

A lightweight lag function which takes in two arguments, one for time stamps and another for the values, would probably solve a lot of problems.

Albert_Zevelev · October 12, 2020, 4:03pm

@RangeFu @matthieu
I think it would be great for Julia to have better Lag/Lead functionality.
In this issue we discuss making lag work similar to STATA:
In Stata: reg y F(-2 0 2).D estimates: y_{it} =\gamma_0 + \gamma_{-2}D_{it-2} + \gamma_{0}D_{it} + \gamma_{2}D_{it+2} +\varepsilon_{it}

It would be great if this was available to the entire Julia ecosystem.

RangeFu · October 12, 2020, 4:34pm

Totally. and that’s exactly the PR @matthieu mentioned above!

While I still like the efficiency of my current method, I agree that a time-aware lag function is the way to go, and Matthieu has a prototype in this PR, so comment there if interested @Albert_Zevelev

Bruno322 · October 18, 2021, 12:13am

Is there any recent development or workaround? I’d like to use the excellent FixedEffectModels package, but my work requires extensive data cleaning, including lagging variables from a panel. It’s the only part that I haven’t managed to find or to make myself. I’d hate to have to throw away my file and rewrite everything in R for such a simple task.

nilshg · October 18, 2021, 6:44am

What is wrong with the workarounds in this thread?

Bruno322 · October 18, 2021, 8:23am

Both loops and combine-groupby are far too slow, and (to me) cumbersome to modify. I wanted to ask, since I couldn’t find it, whether there is a function that lags by ID. For the moment I am using RCall to bypass the problem.

nilshg · October 18, 2021, 9:24am

Seems unlikely to me that loops are far too slow, but probably worth starting a new thread with an MWE that demonstrates the problem.

DataFrames · October 19, 2021, 3:35am

BTW Your version of DataFrames is old.
The following way you can fill the gaps in a simple way (not sure if it is needed in the first place)

dt = DataFrame(
    id = [1,1,1,2,2,2],                 
    t = [1,2,4, 1,2,4],
    val = randn(6)
)

fillgap(x; ∇ = 1) = minimum(x):∇:maximum(x)

left_dt = combine(groupby(dt, 1), :t=>fillgap=>:t)

dt_filled = leftjoin(left_dt, dt, on = [:id, :t])

sort!(dt_filled, 1:2)

RangeFu · October 19, 2021, 3:47am

The method in the original post should be pretty efficient compared to loop or groupby (after you fill the gap).

RangeFu · March 22, 2022, 11:34pm

After using it for a while I found the solution in the original post really restrictive. Being not able to throw away any data to preserve the time continuity is really cumbersome.

So below is my second attempt to address this issue properly. It supports lags wrt a time index, allows for gaps, and only requires the data to be sorted by time.

I tested in limited cases and it works fine. Feedbacks and suggestions are more than welcomed.

https://github.com/FuZhiyu/PanelShift.jl

Topic		Replies	Views
Manipulating Panel Data in Julia General Usage	6	992	September 13, 2022
Not able to use lag and lead function New to Julia package , dataframes	1	662	December 2, 2022
[ANN] PanelShift.jl: Time-aware lags and leads in panel data Package Announcements economics , data , dataframes	0	289	January 21, 2023
Create lead and lag variable in DataFrame General Usage question	14	8071	October 22, 2019
Lag vector by group using another vector as the grouping variable General Usage question	4	336	February 20, 2022

Lag/lead in panel data

Related topics