This discussion is about how to best implement a context for tabular data that generalizes to cross-sectional and temporal dimensions. For example,
subject period value
1 1 0
1 2 3
1 3 2
2 1 2
2 3 4
We would like to first discover and profile current tools in the ecosystem, discuss alternatives found in other implementations, and discuss the proposal designs in the context of Tables.jl + StatsModels.jl.
Ideally, this would be JSOC proposal.
A few people from Slack that have been interested in the topics, @bkamins@nalimilan@oxinabox@dave.f.kleinschmidt@pdeffebach.
I work quite a bit with panel data with an irregular time structure, of the kind
id time value
1 0.7 19
1 0.9 22
2 0.4 7
where actual observation times may not even coincide for individuals. I find the current ecosystem fine (split-apply-combine, summary stats for likelihood based models), so am curious what you are missing.
I am aware that eg R has a lot of packages someone wrote for a special case of the above for some field-specific application. But I find the Julia approach of modular, composable building blocks much more powerful.
For simplicity, I will assume I am working with in memory data and using DataFrames.jl while I would like to have an API to support any Tables.jl (columntable(tbl)).
Many operations would be defined at the panel or cross-sectional subdataframe. Would be nice to handle things such as detecting the frequency and fill-in gaps. In other words, identify the frequency by querying each subdataframe and applying a gap fill-in to each subdataframe. The temporal aspects should support TimeType and have smart frequency detection. For example,
Should identify the frequency to 1 month.
Lags, leads, diffs, reductions, summaries should also respect the panel structure. These might generate missing values after generating model arrays.
Current issues are that one can probably implement the needed functionality for DataFrames, but not for Tables (columntable) so makes it less generable and harder to develop to support the framework.
StatsModels currently assumes no missing values from transformations so a lot of the code ends up being a lot of bookkeeping.
For the table-side of things, it would be what kind of Query-like operations should tabular data support… For example, groupby, split/apply/combine, etc. DataFrames has done a lot of work into implementing a nice interface, but that’s not the case for table-like API.
From a statistics perspective, I don’t think there is a single, unified way of dealing with missing values in models with a nontrivial spatial/temporal structure when MAR is not satisfied, so maybe StatsModels is doing the right thing, in the sense that the user should just drop rows with missing values when MAR is assumed, and otherwise do something custom anyway?
What I like about the way Stata handles this is it handles panel structure at the data level, not at the model level, which would correspond to handling this at the Table level and not at the StatsModels level.
In Stata, once you have set xtset person_id time, time operations just work everywhere. L.x is just a new variable containing lagged x for person i when it exists, and missing when we don’t have an observation of that person at the lagged time. This makes it really easy to inspect the data, which is important for finding the appropriate model. For example, to visualize how changes in hours corresponds to changes in wages, you just do scatter D.wages D.hours. Or, for longer differences, like 4 years instead of 1 year, it’s just scatter S4.wages S4.hours. It’s very flexible.
This doesn’t work very well for Tamas’s situation where the observation times don’t line up nicely.
R’s plm also defines a panel.data type which passes that context to operations. If not mistaken, SAS Software also uses a similar approach.
The idea is to pass that context to the StatsModels in order to have context aware operations.
I’ve long been interested in this, but over the past few months have probably gotten around to something more akin to what I think Tamas is saying - the key operations day-to-day for me are lag/lead and differences (which are of course relatively simple once you have lags).
The combination of by with ShiftedArrays has really grown on my, i.e. by(df, :group_id, :var1 => lag) (or Base.Fix2(lag, n) for n-period lags), and most other things to me seem to be relatively unique use cases where writing slightly more verbose but clear code imho often trumps some “magic” function that happens to work for that special case.