I just transitted from Stata/Pandas to DataFrames.jl. I was wondering what the best practice is to deal with panel data in Julia.

One of the common manipulations for panel data is the lag/lead operator for each individual. At the first glance it seems we could use  transform(groupby(dt, :id), val => lag), as a simple extension of time series lag.

However often the time the data we use has some irregular missing, like:

dt = DataFrame(
id = [1,1,1,2,2,2],
t = [1,2,4, 1,2,4],
val = randn(6)
)


so id1 is missing at t = 3 and id2 is missing at time 2. Due to the different time step across rows, the simple groupby-lag approach doesn’t work. We need to fill the gap first (like tsfill command in state).

Still, the split-apply-combine strategy is not very efficient here. If we sort the data by id-t and fill the gap, the lag/lead is monotone across individuals except the first/last observations for each individual, so we could shift the whole vector once and remove the first/last observations correspondingly. This leads me to the following routine to manage panel data:

1. First fill the gap in the data using groupby-combine strategy:
function paneltfill(dt, id, t, Δt)
sort!(dt, [id, t])
allowmissing!(dt)
dtg = groupby(dt, id)
dt = vcat([_sdf_tfill(sdf, id, t, Δt) for sdf in dtg]...)
end

function _sdf_tfill(sdf, id, t, Δt)
sdf = copy(sdf)
mint, maxt = minimum(sdf[t]), maximum(sdf[t])
fullindex = [mint:Δt:maxt;]
idval = sdf[id]
for row in 1:length(fullindex)
if sdf[row, t] != fullindex[row]
insert!.(eachcol(sdf), row, missing)
sdf[row, t] = fullindex[row]
sdf[row, id] = idval
end
end
return sdf
end



Then construct columns _index and _index_max to keep track of the order of obs. within each individual. We need new columns other than t as oftentimes individuals enter and exit the panel data at different time, so we need a normalized column with the first observation being 1 and the last one recorded in another column.

function setpanelindex!(dt, id, t; indexvar = :_index)
maxindexvar = Symbol(string(indexvar) * "_max")
sort!(dt, [id, t])
@pipe groupby(dt, id) |> transform!(_, t => (x->[1:length(x);]) => indexvar)
@pipe groupby(dt, id) |> transform!(_, t => length => maxindexvar)
return dt
end


Now we could do lag/lead/diffs efficiently using the index columns:


function panelshift(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
maxindexvar = Symbol(string(indexvar) * "_max")
if reindex || indexvar ∉ propertynames(dt)
setpanelindex!(dt, id, t, indexvar = indexvar)
end
newx = Vector(lag(dt[x], Δ))
if Δ>0
newx[dt[indexvar].<=Δ] .= missing
else
newx[dt[indexvar].> dt[maxindexvar] .+ Δ] .= missing
end
return newx
end

function paneldiff(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
Lx = panelshift(dt, x, id, t, Δ = Δ, reindex = reindex, indexvar = indexvar)
return dt[x] - Lx
end



This approach works very efficient, as compared to reindex and shift in pandas. It feels so good not being restricted by the provided APIs but can do anything without worrying about performance The drawbacks of this approach are also very clear, but hopefully not that restrictive: I cannot drop observations that would create gaps in the time series, so if I want to exclude some observations I have to replace data with missing or use a flag column; I also have to stick to the sorting given by id-t pairs, and anytime I change the timespan of the sample I have to reset indexes. This seems to be a potential pitfall for bugs when the project gets larger.

As I’m new to the DataFrames.jl framework, the code above might not be very idiomatic or optimized, so I welcome any comments/critiques on the implementation. Any high-level discussions on the workflow of dealing panel data are also highly appreciated!

2 Likes

A simpler approach is to define a lag function that respects time gaps, and then use transform to apply this function within groups. Comment this PR if you are interested: https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37

1 Like

Are you aware of TimeSeries.jl? They have lag and lead operators. The time series object they define also implement the Tables.jl api in case you need to pass it to other functions in the ecosystem.

Thanks! Yes that would be ideal. Great to know it was already under consideration!

Thanks. AFAIK it won’t work for panel data. Vectors in panel data generally have duplicated timestamps for different individuals, which is not supported by TimeSeries.jl.

1 Like

I haven’t tried this personally, but you should be able to do a combine where you make your sub-dataframes a TimeArray, perform operations on them, then at the end it becomes a data frame again.

combine(groupby(df, :id)) do sdf
TA = TimeArray(sdf, timestamp = :date)
### operations
DataFrame(TA)
end


I don’t have experience with panel data. In fact, it is the first time I hear about it. Is it a table with more than one index? Perhaps you can create a package with this abstraction if it is useful in econometrics and other fields.

Yes it seems a workable idea. The overhead between type transformation might be too large, so probably we need another dedicated method to do the something. Thanks for suggestions!

Yeah basically a dataset indexed by both time and individual ids. It’s commonly used in econometrics. The major unique feature of panel data is the lag/lead operators discussed in this thread. Actually most of the manipulation on panel data can be done by groupby-combine strategy, but for the most efficiency probably one day indeed we need to write down a new data type for it. So far I enjoy using DataFrames a lot.

A lightweight lag function which takes in two arguments, one for time stamps and another for the values, would probably solve a lot of problems.

1 Like

@RangeFu @matthieu
I think it would be great for Julia to have better Lag/Lead functionality.
In this issue we discuss making lag work similar to STATA:
In Stata: reg y F(-2 0 2).D estimates: y_{it} =\gamma_0 + \gamma_{-2}D_{it-2} + \gamma_{0}D_{it} + \gamma_{2}D_{it+2} +\varepsilon_{it}

It would be great if this was available to the entire Julia ecosystem.

Totally. and that’s exactly the PR @matthieu mentioned above!

While I still like the efficiency of my current method, I agree that a time-aware lag function is the way to go, and Matthieu has a prototype in this PR, so comment there if interested @Albert_Zevelev

3 Likes

Is there any recent development or workaround? I’d like to use the excellent FixedEffectModels package, but my work requires extensive data cleaning, including lagging variables from a panel. It’s the only part that I haven’t managed to find or to make myself. I’d hate to have to throw away my file and rewrite everything in R for such a simple task.

What is wrong with the workarounds in this thread?

Both loops and combine-groupby are far too slow, and (to me) cumbersome to modify. I wanted to ask, since I couldn’t find it, whether there is a function that lags by ID. For the moment I am using RCall to bypass the problem.

Seems unlikely to me that loops are far too slow, but probably worth starting a new thread with an MWE that demonstrates the problem.

1 Like

BTW Your version of DataFrames is old.
The following way you can fill the gaps in a simple way (not sure if it is needed in the first place)

dt = DataFrame(
id = [1,1,1,2,2,2],
t = [1,2,4, 1,2,4],
val = randn(6)
)

fillgap(x; ∇ = 1) = minimum(x):∇:maximum(x)

left_dt = combine(groupby(dt, 1), :t=>fillgap=>:t)

dt_filled = leftjoin(left_dt, dt, on = [:id, :t])

sort!(dt_filled, 1:2)

1 Like

The method in the original post should be pretty efficient compared to loop or groupby (after you fill the gap).

1 Like

After using it for a while I found the solution in the original post really restrictive. Being not able to throw away any data to preserve the time continuity is really cumbersome.

So below is my second attempt to address this issue properly. It supports lags wrt a time index, allows for gaps, and only requires the data to be sorted by time.

I tested in limited cases and it works fine. Feedbacks and suggestions are more than welcomed.

1 Like