# Summing (StatsModels) terms

I know that I can do

``````ts = term(:x)
spec = ts ~ term(1) + sum(lag(ts, i) for i in 1:n)
``````

which just puts `n` lags of this variable separately on the right hand side of the equation.

How can I combine them, so that `sum(lag(ts, i) for i in 1:n)` is literally the sum of those items (becoming a single variable, rather than `n`)?

Interesting questions, donâ€™t have the technical means to investigate at my disposal currently, but if it works it would probably work via `identity`, see: https://juliastats.org/StatsModels.jl/dev/formula/#Julia-functions-in-a-@formula-1

1 Like

Okay I had a quick look but unfortunately didnâ€™t find anything that would make this straightforward. If I understand you correctly, you would want to do the equivalent of

``````df.x_sumlag = df.x .+ lag(df.x) .+ lag(df.x, 2) .+ ...
``````

end then use `x_sumlag` as a RHS variable in your formula?

I kind of think this is unusual enough to warrant writing out explicitly like above, but to see whether you can just construct the terms programmatically, I thought Iâ€™d have a look at how the `@formula` macro manages to do this. As I said above, I think the way to go is via `identity`, which allows you to apply functions to your predictors, so I checked:

``````julia> @macroexpand(@formula(y ~ identity(x + lag(x) + lag(x, 2))))
:(StatsModels.Term(:y) ~ StatsModels.capture_call(identity, ((x,)->identity(x + lag(x) + lag(x, 2))), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(identity(x + lag(x) + lag(x, 2)))))))), [StatsModels.Term(:x) + StatsModels.capture_call(lag, ((x,)->lag(x)), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(lag(x))))))), [StatsModels.Term(:x)]) + StatsModels.capture_call(lag, ((x,)->lag(x, 2)), (:x,), \$(Expr(:copyast, :(\$(QuoteNode(:(lag(x, 2))))))), [StatsModels.Term(:x), StatsModels.ConstantTerm(2)])]))
``````

errâ€¦ not sure how helpful that is!? Maybe @dave.f.kleinschmidt has a better idea!

I think ideally you should be able to do

``````sumlags(x, n) = reduce(+, (lag(x, in) for in in 1:n))
``````

and then

``````@formula(y ~ sumlags(x, 5)
``````

this doesnâ€™t work because `sumlags` is applied element-wise.

I think there should be a way to force vector-wise application of functions. That would solve this problem.

1 Like

The specific context is cumulative (log) growth rates, so not too unusual. Itâ€™s just that Iâ€™d like the flexibility of the specification to be in the formula. Constructing the columns ahead of time would be somewhat prohibitive in this setting.

I have tried things along the lines of what both of you have suggested (separate functions, use `identity`). I pretty much always get out a tuple of vectors rather than a single vector.

Right now I have a multi-step procedure that seems to work ok (took me some time to get there though). I pass the formula and data to `modelcols`, then I use `sum(reduce(hcat, thosedangcols); dims=2)` to collapse the problem piece, and then I pass the arrays to `lm(X,y)` (excluding the missing rows).

Is that really less work than using a `sumlags` function on the data frame before?

Not using a DataFrame! These are univariate predictive regressions and the data, which is simulated, comes in as a single vector.

If itâ€™s very simple like that then your solution makes more sense.

I would still file an issue in StatsModels as I described above. `lag` is special-cased iirc. which makes things a bit unintuitive.

It looks like the implementation of `lag` is quite complicated though, here.

1 Like

The short answer is no, not currently. The reasons are a little arcane and I honestly still struggle to find the language to explain the difference, but here goes

In StatsModels formula world, there are two domains that functions can operate on: they can take in terms, returning new terms, or they can operate on values. All `FunctionTerms` are of the second type: they take in data values and return new data values at the point when the model matrix is constructed (e.g., `modelcols`). Everything else is of the first type, including `lag`: it takes in a term, and returns a new term that wraps it. The reason this distinction is necessary is that, in general, many transformations are â€śstatefulâ€ť: you canâ€™t lag a term without having the whole column available, or at least without keeping track of some state (which hasnâ€™t been implemented yet, FWIWâ€¦). Same thing for a categorical embedding (need to know the levels), using a spline basis function (need to know the knots), etc. `FunctionTerm` is a kind of special-case catch-all for anything that ISNâ€™T stateful, and can be computed elementwise.

However, the real reason this isnâ€™t possible currently has more to do with how `FunctionTerm`s are implemented efficiently, which is that as soon as a non-special call is encountered in the AST (e.g., an `identity` wrapper), a single anonymous function is create that will evaluate the whole subtree. What that means is that any later, term-level calls like `lag` are â€śprotectedâ€ť from any special interpretation, since theyâ€™re now moved inside teh body of an anonymous function that will only be called during `modelcols`.

HOWEVER, https://github.com/JuliaStats/StatsModels.jl/pull/183 changes this, and one of the consequences is that the â€śprotectedâ€ť/â€śunprotectedâ€ť distinction that was previously implicit can be explicitly controlled. For your case youâ€™d be able to do something like ` ~ protect(unprotect(lag(x, 1)) + unprotect(lag(x, 2)) + unprotect(lag(x, 3)) + ...)`. The `protect` has the effect of â€śblockingâ€ť normal formula syntax, so `+` becomes literal `+` (applied elementwise to the columns returned by the inner terms). The `unprotect` is necessary to lift the function calls back to operating on the level of terms (instead of elements), since `lag(t::Term, deg)` is whatâ€™s happening internally to create a `LagTerm`.

That syntax is a bit clunky for this use case, and Iâ€™m honestly not sure itâ€™d work with run-time creation. Another consequence of the overhaul is that you can actually construct `FunctionTerm`s at run time, so you could MAYBE do something like `FunctionTerm(+, [unprotect(lag(ts, i)) for i in 1:n]...)`.

In any case, could you comment on #183? Itâ€™s useful to have a record of when these sorts of features would be useful, and to stress-test the proposed implementation and syntax!

Iâ€™d say that the short-term solution would just be to create your own wrapper term type (like `cumlag` or something) which generates the `LagTerm`s and has a special method for `modelcols`â€¦I could mock something up for that later if it would be useful (feel free to ping), or you could refer to the `poly` example in the docs, or to how `lag` is handled.

3 Likes

The issue with providing a mechanism to â€śforceâ€ť vector-wise application is that it makes row-wise Tables second-class citizens. Or it makes it easier for developers/users to make them second-class citizens, and that to me feels like a foot-gunâ€¦

Hereâ€™s a very quick implementation that should work on a recent release (but would need an update if that PR is mergedâ€¦):

``````using StatsModels, ShiftedArrays

struct CumulativeLagTerm <: AbstractTerm
var::Symbol
deg::Int
end

CumulativeLagTerm(t::Term, deg::ConstantTerm) = CumulativeLagTerm(t.sym, deg.n)

function cum_lag end

StatsModels.apply_schema(t::FunctionTerm{typeof(cum_lag)}, sch::StatsModels.Schema, Mod::Type) =
CumulativeLagTerm(t.args_parsed...)

StatsModels.modelcols(t::CumulativeLagTerm, d::NamedTuple) =
sum(lag(getproperty(d, t.var), n) for n in 0:t.deg)
``````

If you want more fanciness, you can replace the `Symbol` field with an `AbstractTerm` field, and then recursively call `apply_schema` and `modelcols` on that (before `lag`). And this will allocate multiple lagged copies during `modelcols` but could be easily adapted in place I think, and anyway I imagine the lagged copies are lazy/views.

And to see that it works:

``````julia> d = (y = rand(10), x = rand(10))

(y = [0.6652339141519401, 0.8199830613408774, 0.3082719190712637, 0.5641384193416041, 0.9162429720895131, 0.5964769804332677, 0.16593915691237204, 0.7772815057098379, 0.6761020580661794, 0.5611322216599754], x = [0.8168175392602228, 0.07952938889286587, 0.5336111528611283, 0.1991051802567536, 0.8827738557801499, 0.1735290642427343, 0.8198276589487901, 0.43795516094329745, 0.40397516199161654, 0.844752403741194])

julia> f = @formula(y ~ 1 + cum_lag(x, 3));

julia> ff = apply_schema(f, schema(d))

FormulaTerm
Response:
y(continuous)
Predictors:
1
CumulativeLagTerm(:x, 3)

julia> modelcols(ff.rhs, d)

10Ă—2 reshape(::Array{Union{Missing, Float64},2}, 10, 2) with eltype Union{Missing, Float64}:
1.0   missing
1.0   missing
1.0   missing
1.0  1.62906
1.0  1.69502
1.0  1.78902
1.0  2.07524
1.0  2.31409
1.0  1.83529
1.0  2.50651

julia> let x=d.x
x .+ lag(x) .+ lag(x, 2) .+ lag(x, 3)
end

10-element Array{Union{Missing, Float64},1}:
missing
missing
missing
1.6290632612709706
1.6950195777908976
1.789019253140766
2.075235759228428
2.3140857399149715
1.8352870461264383
2.506510385624898
``````
2 Likes