Summing (StatsModels) terms

I know that I can do

ts = term(:x)
spec = ts ~ term(1) + sum(lag(ts, i) for i in 1:n)

which just puts n lags of this variable separately on the right hand side of the equation.

How can I combine them, so that sum(lag(ts, i) for i in 1:n) is literally the sum of those items (becoming a single variable, rather than n)?

Interesting questions, don’t have the technical means to investigate at my disposal currently, but if it works it would probably work via identity, see: Modeling tabular data · StatsModels.jl

1 Like

Okay I had a quick look but unfortunately didn’t find anything that would make this straightforward. If I understand you correctly, you would want to do the equivalent of

df.x_sumlag = df.x .+ lag(df.x) .+ lag(df.x, 2) .+ ...

end then use x_sumlag as a RHS variable in your formula?

I kind of think this is unusual enough to warrant writing out explicitly like above, but to see whether you can just construct the terms programmatically, I thought I’d have a look at how the @formula macro manages to do this. As I said above, I think the way to go is via identity, which allows you to apply functions to your predictors, so I checked:

julia> @macroexpand(@formula(y ~ identity(x + lag(x) + lag(x, 2))))
:(StatsModels.Term(:y) ~ StatsModels.capture_call(identity, ((x,)->identity(x + lag(x) + lag(x, 2))), (:x,), $(Expr(:copyast, :($(QuoteNode(:(identity(x + lag(x) + lag(x, 2)))))))), [StatsModels.Term(:x) + StatsModels.capture_call(lag, ((x,)->lag(x)), (:x,), $(Expr(:copyast, :($(QuoteNode(:(lag(x))))))), [StatsModels.Term(:x)]) + StatsModels.capture_call(lag, ((x,)->lag(x, 2)), (:x,), $(Expr(:copyast, :($(QuoteNode(:(lag(x, 2))))))), [StatsModels.Term(:x), StatsModels.ConstantTerm(2)])]))

err… not sure how helpful that is!? Maybe @dave.f.kleinschmidt has a better idea!

I think ideally you should be able to do

sumlags(x, n) = reduce(+, (lag(x, in) for in in 1:n))

and then

@formula(y ~ sumlags(x, 5)

this doesn’t work because sumlags is applied element-wise.

I think there should be a way to force vector-wise application of functions. That would solve this problem.

1 Like

The specific context is cumulative (log) growth rates, so not too unusual. It’s just that I’d like the flexibility of the specification to be in the formula. Constructing the columns ahead of time would be somewhat prohibitive in this setting.

I have tried things along the lines of what both of you have suggested (separate functions, use identity). I pretty much always get out a tuple of vectors rather than a single vector.

Right now I have a multi-step procedure that seems to work ok (took me some time to get there though). I pass the formula and data to modelcols, then I use sum(reduce(hcat, thosedangcols); dims=2) to collapse the problem piece, and then I pass the arrays to lm(X,y) (excluding the missing rows).

Is that really less work than using a sumlags function on the data frame before?

Not using a DataFrame! These are univariate predictive regressions and the data, which is simulated, comes in as a single vector.

If it’s very simple like that then your solution makes more sense.

I would still file an issue in StatsModels as I described above. lag is special-cased iirc. which makes things a bit unintuitive.

It looks like the implementation of lag is quite complicated though, here.

1 Like

The short answer is no, not currently. The reasons are a little arcane and I honestly still struggle to find the language to explain the difference, but here goes :slight_smile:

In StatsModels formula world, there are two domains that functions can operate on: they can take in terms, returning new terms, or they can operate on values. All FunctionTerms are of the second type: they take in data values and return new data values at the point when the model matrix is constructed (e.g., modelcols). Everything else is of the first type, including lag: it takes in a term, and returns a new term that wraps it. The reason this distinction is necessary is that, in general, many transformations are “stateful”: you can’t lag a term without having the whole column available, or at least without keeping track of some state (which hasn’t been implemented yet, FWIW…). Same thing for a categorical embedding (need to know the levels), using a spline basis function (need to know the knots), etc. FunctionTerm is a kind of special-case catch-all for anything that ISN’T stateful, and can be computed elementwise.

However, the real reason this isn’t possible currently has more to do with how FunctionTerms are implemented efficiently, which is that as soon as a non-special call is encountered in the AST (e.g., an identity wrapper), a single anonymous function is create that will evaluate the whole subtree. What that means is that any later, term-level calls like lag are “protected” from any special interpretation, since they’re now moved inside teh body of an anonymous function that will only be called during modelcols.

HOWEVER, https://github.com/JuliaStats/StatsModels.jl/pull/183 changes this, and one of the consequences is that the “protected”/“unprotected” distinction that was previously implicit can be explicitly controlled. For your case you’d be able to do something like ~ protect(unprotect(lag(x, 1)) + unprotect(lag(x, 2)) + unprotect(lag(x, 3)) + ...). The protect has the effect of “blocking” normal formula syntax, so + becomes literal + (applied elementwise to the columns returned by the inner terms). The unprotect is necessary to lift the function calls back to operating on the level of terms (instead of elements), since lag(t::Term, deg) is what’s happening internally to create a LagTerm.

That syntax is a bit clunky for this use case, and I’m honestly not sure it’d work with run-time creation. Another consequence of the overhaul is that you can actually construct FunctionTerms at run time, so you could MAYBE do something like FunctionTerm(+, [unprotect(lag(ts, i)) for i in 1:n]...).

In any case, could you comment on #183? It’s useful to have a record of when these sorts of features would be useful, and to stress-test the proposed implementation and syntax!

I’d say that the short-term solution would just be to create your own wrapper term type (like cumlag or something) which generates the LagTerms and has a special method for modelcols…I could mock something up for that later if it would be useful (feel free to ping), or you could refer to the poly example in the docs, or to how lag is handled.

3 Likes

The issue with providing a mechanism to “force” vector-wise application is that it makes row-wise Tables second-class citizens. Or it makes it easier for developers/users to make them second-class citizens, and that to me feels like a foot-gun…

Here’s a very quick implementation that should work on a recent release (but would need an update if that PR is merged…):

using StatsModels, ShiftedArrays

struct CumulativeLagTerm <: AbstractTerm
    var::Symbol
    deg::Int
end

CumulativeLagTerm(t::Term, deg::ConstantTerm) = CumulativeLagTerm(t.sym, deg.n)

function cum_lag end

StatsModels.apply_schema(t::FunctionTerm{typeof(cum_lag)}, sch::StatsModels.Schema, Mod::Type) =
    CumulativeLagTerm(t.args_parsed...)

StatsModels.modelcols(t::CumulativeLagTerm, d::NamedTuple) =
    sum(lag(getproperty(d, t.var), n) for n in 0:t.deg)

If you want more fanciness, you can replace the Symbol field with an AbstractTerm field, and then recursively call apply_schema and modelcols on that (before lag). And this will allocate multiple lagged copies during modelcols but could be easily adapted in place I think, and anyway I imagine the lagged copies are lazy/views.

And to see that it works:

julia> d = (y = rand(10), x = rand(10))

 (y = [0.6652339141519401, 0.8199830613408774, 0.3082719190712637, 0.5641384193416041, 0.9162429720895131, 0.5964769804332677, 0.16593915691237204, 0.7772815057098379, 0.6761020580661794, 0.5611322216599754], x = [0.8168175392602228, 0.07952938889286587, 0.5336111528611283, 0.1991051802567536, 0.8827738557801499, 0.1735290642427343, 0.8198276589487901, 0.43795516094329745, 0.40397516199161654, 0.844752403741194])

julia> f = @formula(y ~ 1 + cum_lag(x, 3));

julia> ff = apply_schema(f, schema(d))

 FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  CumulativeLagTerm(:x, 3)

julia> modelcols(ff.rhs, d)

 10Ă—2 reshape(::Array{Union{Missing, Float64},2}, 10, 2) with eltype Union{Missing, Float64}:
 1.0   missing
 1.0   missing
 1.0   missing
 1.0  1.62906
 1.0  1.69502
 1.0  1.78902
 1.0  2.07524
 1.0  2.31409
 1.0  1.83529
 1.0  2.50651

julia> let x=d.x
           x .+ lag(x) .+ lag(x, 2) .+ lag(x, 3)
       end

 10-element Array{Union{Missing, Float64},1}:
  missing
  missing
  missing
 1.6290632612709706
 1.6950195777908976
 1.789019253140766
 2.075235759228428
 2.3140857399149715
 1.8352870461264383
 2.506510385624898
2 Likes