# Summing (StatsModels) terms

I know that I can do

``````ts = term(:x)
spec = ts ~ term(1) + sum(lag(ts, i) for i in 1:n)
``````

which just puts `n` lags of this variable separately on the right hand side of the equation.

How can I combine them, so that `sum(lag(ts, i) for i in 1:n)` is literally the sum of those items (becoming a single variable, rather than `n`)?

Interesting questions, don’t have the technical means to investigate at my disposal currently, but if it works it would probably work via `identity`, see: https://juliastats.org/StatsModels.jl/dev/formula/#Julia-functions-in-a-@formula-1

1 Like

Okay I had a quick look but unfortunately didn’t find anything that would make this straightforward. If I understand you correctly, you would want to do the equivalent of

``````df.x_sumlag = df.x .+ lag(df.x) .+ lag(df.x, 2) .+ ...
``````

end then use `x_sumlag` as a RHS variable in your formula?

I kind of think this is unusual enough to warrant writing out explicitly like above, but to see whether you can just construct the terms programmatically, I thought I’d have a look at how the `@formula` macro manages to do this. As I said above, I think the way to go is via `identity`, which allows you to apply functions to your predictors, so I checked:

``````julia> @macroexpand(@formula(y ~ identity(x + lag(x) + lag(x, 2))))
:(StatsModels.Term(:y) ~ StatsModels.capture_call(identity, ((x,)->identity(x + lag(x) + lag(x, 2))), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(identity(x + lag(x) + lag(x, 2)))))))), [StatsModels.Term(:x) + StatsModels.capture_call(lag, ((x,)->lag(x)), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(lag(x))))))), [StatsModels.Term(:x)]) + StatsModels.capture_call(lag, ((x,)->lag(x, 2)), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(lag(x, 2))))))), [StatsModels.Term(:x), StatsModels.ConstantTerm(2)])]))
``````

err… not sure how helpful that is!? Maybe @dave.f.kleinschmidt has a better idea!

I think ideally you should be able to do

``````sumlags(x, n) = reduce(+, (lag(x, in) for in in 1:n))
``````

and then

``````@formula(y ~ sumlags(x, 5)
``````

this doesn’t work because `sumlags` is applied element-wise.

I think there should be a way to force vector-wise application of functions. That would solve this problem.

1 Like

The specific context is cumulative (log) growth rates, so not too unusual. It’s just that I’d like the flexibility of the specification to be in the formula. Constructing the columns ahead of time would be somewhat prohibitive in this setting.

I have tried things along the lines of what both of you have suggested (separate functions, use `identity`). I pretty much always get out a tuple of vectors rather than a single vector.

Right now I have a multi-step procedure that seems to work ok (took me some time to get there though). I pass the formula and data to `modelcols`, then I use `sum(reduce(hcat, thosedangcols); dims=2)` to collapse the problem piece, and then I pass the arrays to `lm(X,y)` (excluding the missing rows).

Is that really less work than using a `sumlags` function on the data frame before?

Not using a DataFrame! These are univariate predictive regressions and the data, which is simulated, comes in as a single vector.

If it’s very simple like that then your solution makes more sense.

I would still file an issue in StatsModels as I described above. `lag` is special-cased iirc. which makes things a bit unintuitive.

It looks like the implementation of `lag` is quite complicated though, here.

1 Like

The short answer is no, not currently. The reasons are a little arcane and I honestly still struggle to find the language to explain the difference, but here goes In StatsModels formula world, there are two domains that functions can operate on: they can take in terms, returning new terms, or they can operate on values. All `FunctionTerms` are of the second type: they take in data values and return new data values at the point when the model matrix is constructed (e.g., `modelcols`). Everything else is of the first type, including `lag`: it takes in a term, and returns a new term that wraps it. The reason this distinction is necessary is that, in general, many transformations are “stateful”: you can’t lag a term without having the whole column available, or at least without keeping track of some state (which hasn’t been implemented yet, FWIW…). Same thing for a categorical embedding (need to know the levels), using a spline basis function (need to know the knots), etc. `FunctionTerm` is a kind of special-case catch-all for anything that ISN’T stateful, and can be computed elementwise.

However, the real reason this isn’t possible currently has more to do with how `FunctionTerm`s are implemented efficiently, which is that as soon as a non-special call is encountered in the AST (e.g., an `identity` wrapper), a single anonymous function is create that will evaluate the whole subtree. What that means is that any later, term-level calls like `lag` are “protected” from any special interpretation, since they’re now moved inside teh body of an anonymous function that will only be called during `modelcols`.

HOWEVER, https://github.com/JuliaStats/StatsModels.jl/pull/183 changes this, and one of the consequences is that the “protected”/“unprotected” distinction that was previously implicit can be explicitly controlled. For your case you’d be able to do something like ` ~ protect(unprotect(lag(x, 1)) + unprotect(lag(x, 2)) + unprotect(lag(x, 3)) + ...)`. The `protect` has the effect of “blocking” normal formula syntax, so `+` becomes literal `+` (applied elementwise to the columns returned by the inner terms). The `unprotect` is necessary to lift the function calls back to operating on the level of terms (instead of elements), since `lag(t::Term, deg)` is what’s happening internally to create a `LagTerm`.

That syntax is a bit clunky for this use case, and I’m honestly not sure it’d work with run-time creation. Another consequence of the overhaul is that you can actually construct `FunctionTerm`s at run time, so you could MAYBE do something like `FunctionTerm(+, [unprotect(lag(ts, i)) for i in 1:n]...)`.

In any case, could you comment on #183? It’s useful to have a record of when these sorts of features would be useful, and to stress-test the proposed implementation and syntax!

I’d say that the short-term solution would just be to create your own wrapper term type (like `cumlag` or something) which generates the `LagTerm`s and has a special method for `modelcols`…I could mock something up for that later if it would be useful (feel free to ping), or you could refer to the `poly` example in the docs, or to how `lag` is handled.

3 Likes

The issue with providing a mechanism to “force” vector-wise application is that it makes row-wise Tables second-class citizens. Or it makes it easier for developers/users to make them second-class citizens, and that to me feels like a foot-gun…

Here’s a very quick implementation that should work on a recent release (but would need an update if that PR is merged…):

``````using StatsModels, ShiftedArrays

struct CumulativeLagTerm <: AbstractTerm
var::Symbol
deg::Int
end

CumulativeLagTerm(t::Term, deg::ConstantTerm) = CumulativeLagTerm(t.sym, deg.n)

function cum_lag end

StatsModels.apply_schema(t::FunctionTerm{typeof(cum_lag)}, sch::StatsModels.Schema, Mod::Type) =
CumulativeLagTerm(t.args_parsed...)

StatsModels.modelcols(t::CumulativeLagTerm, d::NamedTuple) =
sum(lag(getproperty(d, t.var), n) for n in 0:t.deg)
``````

If you want more fanciness, you can replace the `Symbol` field with an `AbstractTerm` field, and then recursively call `apply_schema` and `modelcols` on that (before `lag`). And this will allocate multiple lagged copies during `modelcols` but could be easily adapted in place I think, and anyway I imagine the lagged copies are lazy/views.

And to see that it works:

``````julia> d = (y = rand(10), x = rand(10))

(y = [0.6652339141519401, 0.8199830613408774, 0.3082719190712637, 0.5641384193416041, 0.9162429720895131, 0.5964769804332677, 0.16593915691237204, 0.7772815057098379, 0.6761020580661794, 0.5611322216599754], x = [0.8168175392602228, 0.07952938889286587, 0.5336111528611283, 0.1991051802567536, 0.8827738557801499, 0.1735290642427343, 0.8198276589487901, 0.43795516094329745, 0.40397516199161654, 0.844752403741194])

julia> f = @formula(y ~ 1 + cum_lag(x, 3));

julia> ff = apply_schema(f, schema(d))

FormulaTerm
Response:
y(continuous)
Predictors:
1
CumulativeLagTerm(:x, 3)

julia> modelcols(ff.rhs, d)

10×2 reshape(::Array{Union{Missing, Float64},2}, 10, 2) with eltype Union{Missing, Float64}:
1.0   missing
1.0   missing
1.0   missing
1.0  1.62906
1.0  1.69502
1.0  1.78902
1.0  2.07524
1.0  2.31409
1.0  1.83529
1.0  2.50651

julia> let x=d.x
x .+ lag(x) .+ lag(x, 2) .+ lag(x, 3)
end

10-element Array{Union{Missing, Float64},1}:
missing
missing
missing
1.6290632612709706
1.6950195777908976
1.789019253140766
2.075235759228428
2.3140857399149715
1.8352870461264383
2.506510385624898
``````
2 Likes