I just transitted from Stata/Pandas to DataFrames.jl. I was wondering what the best practice is to deal with panel data in Julia.
One of the common manipulations for panel data is the lag/lead operator for each individual. At the first glance it seems we could use transform(groupby(dt, :id), val => lag)
, as a simple extension of time series lag
.
However often the time the data we use has some irregular missing, like:
dt = DataFrame(
id = [1,1,1,2,2,2],
t = [1,2,4, 1,2,4],
val = randn(6)
)
so id1 is missing at t = 3 and id2 is missing at time 2. Due to the different time step across rows, the simple groupby-lag approach doesn’t work. We need to fill the gap first (like tsfill
command in state).
Still, the split-apply-combine strategy is not very efficient here. If we sort the data by id-t and fill the gap, the lag/lead is monotone across individuals except the first/last observations for each individual, so we could shift the whole vector once and remove the first/last observations correspondingly. This leads me to the following routine to manage panel data:
- First fill the gap in the data using groupby-combine strategy:
function paneltfill(dt, id, t, Δt)
sort!(dt, [id, t])
allowmissing!(dt)
dtg = groupby(dt, id)
dt = vcat([_sdf_tfill(sdf, id, t, Δt) for sdf in dtg]...)
end
function _sdf_tfill(sdf, id, t, Δt)
sdf = copy(sdf)
mint, maxt = minimum(sdf[t]), maximum(sdf[t])
fullindex = [mint:Δt:maxt;]
idval = sdf[id][1]
for row in 1:length(fullindex)
if sdf[row, t] != fullindex[row]
insert!.(eachcol(sdf), row, missing)
sdf[row, t] = fullindex[row]
sdf[row, id] = idval
end
end
return sdf
end
Then construct columns _index
and _index_max
to keep track of the order of obs. within each individual. We need new columns other than t
as oftentimes individuals enter and exit the panel data at different time, so we need a normalized column with the first observation being 1 and the last one recorded in another column.
function setpanelindex!(dt, id, t; indexvar = :_index)
maxindexvar = Symbol(string(indexvar) * "_max")
sort!(dt, [id, t])
@pipe groupby(dt, id) |> transform!(_, t => (x->[1:length(x);]) => indexvar)
@pipe groupby(dt, id) |> transform!(_, t => length => maxindexvar)
return dt
end
Now we could do lag/lead/diffs efficiently using the index columns:
function panelshift(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
maxindexvar = Symbol(string(indexvar) * "_max")
if reindex || indexvar ∉ propertynames(dt)
setpanelindex!(dt, id, t, indexvar = indexvar)
end
newx = Vector(lag(dt[x], Δ))
if Δ>0
newx[dt[indexvar].<=Δ] .= missing
else
newx[dt[indexvar].> dt[maxindexvar] .+ Δ] .= missing
end
return newx
end
function paneldiff(dt, x, id, t; Δ = 1, reindex = false, indexvar = :_index)
Lx = panelshift(dt, x, id, t, Δ = Δ, reindex = reindex, indexvar = indexvar)
return dt[x] - Lx
end
This approach works very efficient, as compared to reindex
and shift
in pandas. It feels so good not being restricted by the provided APIs but can do anything without worrying about performance
The drawbacks of this approach are also very clear, but hopefully not that restrictive: I cannot drop observations that would create gaps in the time series, so if I want to exclude some observations I have to replace data with missing or use a flag column; I also have to stick to the sorting given by id-t pairs, and anytime I change the timespan of the sample I have to reset indexes. This seems to be a potential pitfall for bugs when the project gets larger.
As I’m new to the DataFrames.jl framework, the code above might not be very idiomatic or optimized, so I welcome any comments/critiques on the implementation. Any high-level discussions on the workflow of dealing panel data are also highly appreciated!