Getting RegressionFormulae ^ (interaction) operator to work

Ross_Boylan · April 8, 2022, 9:10am

My effort to use the ^ operator from RegressionFormulae doesn’t seem to have worked. What am I missing?

The docs just referenced say

(a+b+c)^2 generates a + b + c + a&b + a&c + b&c , but not a&b&c

Test code:

using CategoricalArrays
using StatsModels
using DataFrames
using RegressionFormulae

function follow(f::FormulaTerm, df::DataFrame)
    print("formula: ", f, "\n  with terms ", terms(f))
    s = schema(f, df)
    print("\nschema: ", s)
    ts = apply_schema(f, s)
    print("apply_schema: ", ts, " with coefficient names", coefnames(ts))

    x = modelcols(ts.rhs, df)
    print("\nmodelcols: the the result has size ", size(x), " and type ", typeof(x))
    ## it's Matrix{Float64} and I see no labels on it.  Nor does the type allow any.
    
end

df = DataFrame(a=[1, 2, 3], b=[0, 0, 1], c=[0.5, 2.5, 20], d=categorical(["xa", "xa", "xb"]))
follow(@formula(a~(a+b+c)^2), df)

Result:

formula: a ~ :((a + b + c) ^ 2)
  with terms Term[a, b, c]
schema: StatsModels.Schema with 3 entries:
  c => c
  b => b
  a => a
apply_schema: a ~ :((a + b + c) ^ 2) with coefficient names("a", "(a + b + c) ^ 2")
modelcols: the the result has size (3, 1) and type Matrix{Float64}

Expected result:
After apply_schema coefficient names would reflect the expanded list of terms, and modelcols would have more than 1 column (like 7).

pdeffebach · April 8, 2022, 8:06pm

Interesting. A cursory glance at the package indicates this may be a bug. Please file an issue at RegressionFormulae.jl

Ross_Boylan · April 9, 2022, 5:23am

Short version: I think RegressionFormulae requires contextual cues for the interaction operator to work. I didn’t provide them.

The test suite for RegressionFormulae for “powers of terms” has some crucial differences from what I did. It also has some broken tests covering more exotic cases than mine. I think the crucial thing is that they all get results by performing a fit on a special dummy model defined for testing, and that model is a subtype of RegressionModel. The tests call fit on the model, which I think ends up invoking this line, which has a specialization for RegressionModel.

The code of that function invokes ModelFrame with the argument model=RegressionModel. I think as currently written RegressionFormulaes interpretation of ^ only operates when this contextual cue is available.

In contrast, my code invoked schema, apply_schema, and modelcols consistent with current recommendations for StatsModels; those methods are even in commented out code in the previous fit function. At least as I did it, they got no clues they were operating on a RegressionModel.

P.S. The one model column my test code produces is the result interpreting the formula very naively; it is the result of adding the 3 columns a, b and c and then squaring the result, element by element. So it is not only ^ but + that get interpreted as ordinary arithmetic, rather than special terms in the formula.

Ross_Boylan · April 29, 2022, 6:41pm

If I change the invocation from

ts = apply_schema(f, s)

to

ts = apply_schema(f, s, RegressionModel)

(the last argument is the type that the modified handling of ^ applies to), things work as expected. The output is now

formula: a ~ :((a + b + c) ^ 2)
  with terms Term[a, b, c]
schema: StatsModels.Schema with 3 entries:
  b => b
  a => a
  c => c
apply_schema: a ~ 1 + a + b + c + a & b + a & c + b & c with coefficient names("a", ["(Intercept)", "a", "b", "c", "a & b", "a & c", "b & c"])
modelcols: the the result has size (3, 7) and type Matrix{Float64}
Model Matrix:
[1.0 1.0 0.0 0.5 0.0 0.5 0.0; 1.0 2.0 0.0 2.5 0.0 5.0 0.0; 1.0 3.0 1.0 20.0 3.0 60.0 20.0]

Albert_Zevelev · April 29, 2022, 7:20pm

Thanks for bringing this up. I raised an issue in this repo 2 years ago.
If we want ppl to use Julia for serious stats work I think it is crucial to make these types of operations composable for stats-work throughout the entire Julia ecosystem:

github.com/kleinschmidt/RegressionFormulae.jl

Great Idea

opened 06:41PM - 28 Apr 20 UTC

azev77

Hi @kleinschmidt, I think the Julia ecosystem would benefit from something like… this! If we wanna do serious stats it should be easy to automatically generate all interactions (up order n) etc. Some things I find particularly useful in my other stats packages outside Julia: 1. "i.x1" makes x1 into a [factor variable](https://www.stata.com/features/overview/factor-variables/) in a formula Suppose x1 takes the values: 1.2, 5, 6.4 `reg y x1`: treats x1 as continuous & returns 1 coef (assuming no intercept) `reg y i.x1` creates 3 dummies for each level of x1 & returns 3 coefficients (if there is an intercept it randomly drops one level unless the user chooses which level to drop) 2. `i.x1#(c.x2 i.x3)` Interacts all dummies of x1 w/ x2 (continuous) Interacts all dummies of x1 w/ all dummies of x3 3. Leads & Lags. Suppose D is at the state-year level. `L.D`: creates a 1 year lag of D `L(4).D`: creates a 4 year lag of D `F(4).D`: creates a 4 year lead of D. $D_{t+4}$ `reg y F(-1 0 1 2).D` estimates: y_t =b_{-1} x_{t-1} + b_{0} x_{t} +b_{1} x_{t+1} +b_{2} x_{t+2} If Julia is to be "[as easy for statistics as R](https://julialang.org/blog/2012/02/why-we-created-julia/)" these features should be in StatsModels. I'd love to help if I can.

@bkamins are there convenient-composable ways to do these things (raise a formula to a power, create interactions w/ factor variables, make arbitrary lead/lags) in the Julia ecosystem?
-By composable I mean I can do the same thing whether I’m working w/ FixedEffects.jl or a time-series package…

bkamins · April 30, 2022, 6:35am

I think we still need to work on this. Probably @kleinschmidt is the best person to ask about the current status.

nalimilan · April 30, 2022, 8:53pm

lead and lag are supported by StatsModels: Temporal variables and Time Series Terms · StatsModels.jl. Interactions with factor (categorical) variables can be obtained using a*b or a&b just like with numeric variables. Do you have particular requirements that these options don’t meet?

Albert_Zevelev · May 1, 2022, 11:17pm

I went into some detail in the issue I raised in the link above.
There are certain basic operations that economists use daily in STATA/R that are not yet convenient in the Julia ecosystem:

One example:
in STATA reg y F(-1 0 1 2).D
estimates y_{t} = \beta_{-1} x_{t-1} + \beta_{0} x_{t} + \beta_{1} x_{t+1} + \beta_{2} x_{t+2} +\varepsilon_{t}
I don’t know how to automatically do this (& the other things raised in the issue) in Julia

dave.f.kleinschmidt · May 13, 2022, 3:10pm

yeah, you got it I think this is a docs issue if you had to go into the tests to figure it out. Specializing on the ‘context clues’ (model type we’re generating a formula for) is an important part of the mechanism that allows potentially competing claims to the same syntax.

we’d also consider changing RegressionFormulae.jl to use Mod::Type{<:Any} until someone complains.

also note that there are other situations where the behavior of apply_schema changes depending on the context, most notable the logic that promotes categorical terms to “full dummy” coding when necessary (only happens for Mod::Type{<:StatisticalModel} since it may or may not make sense to do that promotion in more ML-flavored situations).

Topic		Replies	Views
Regression with variables in array Statistics	2	457	December 30, 2022
How to create a loop for a regression model? Statistics	8	2179	December 29, 2022
Extending the `@formula` syntax to generate functions Statistics	3	528	July 20, 2020
How do I do a regression using programatically defined column names? General Usage question , dataframes , glm	4	234	July 24, 2023
Convenient way to generate all interaction terms up to a specified order Statistics question	2	1574	September 28, 2019

Getting RegressionFormulae ^ (interaction) operator to work

Related topics