Getting RegressionFormulae ^ (interaction) operator to work

My effort to use the ^ operator from RegressionFormulae doesn’t seem to have worked. What am I missing?

The docs just referenced say

(a+b+c)^2 generates a + b + c + a&b + a&c + b&c , but not a&b&c

Test code:

using CategoricalArrays
using StatsModels
using DataFrames
using RegressionFormulae

function follow(f::FormulaTerm, df::DataFrame)
    print("formula: ", f, "\n  with terms ", terms(f))
    s = schema(f, df)
    print("\nschema: ", s)
    ts = apply_schema(f, s)
    print("apply_schema: ", ts, " with coefficient names", coefnames(ts))

    x = modelcols(ts.rhs, df)
    print("\nmodelcols: the the result has size ", size(x), " and type ", typeof(x))
    ## it's Matrix{Float64} and I see no labels on it.  Nor does the type allow any.
    
end

df = DataFrame(a=[1, 2, 3], b=[0, 0, 1], c=[0.5, 2.5, 20], d=categorical(["xa", "xa", "xb"]))
follow(@formula(a~(a+b+c)^2), df)

Result:

formula: a ~ :((a + b + c) ^ 2)
  with terms Term[a, b, c]
schema: StatsModels.Schema with 3 entries:
  c => c
  b => b
  a => a
apply_schema: a ~ :((a + b + c) ^ 2) with coefficient names("a", "(a + b + c) ^ 2")
modelcols: the the result has size (3, 1) and type Matrix{Float64}

Expected result:
After apply_schema coefficient names would reflect the expanded list of terms, and modelcols would have more than 1 column (like 7).

Interesting. A cursory glance at the package indicates this may be a bug. Please file an issue at RegressionFormulae.jl

Short version: I think RegressionFormulae requires contextual cues for the interaction operator to work. I didn’t provide them.

The test suite for RegressionFormulae for “powers of terms” has some crucial differences from what I did. It also has some broken tests covering more exotic cases than mine. I think the crucial thing is that they all get results by performing a fit on a special dummy model defined for testing, and that model is a subtype of RegressionModel. The tests call fit on the model, which I think ends up invoking this line, which has a specialization for RegressionModel.

The code of that function invokes ModelFrame with the argument model=RegressionModel. I think as currently written RegressionFormulaes interpretation of ^ only operates when this contextual cue is available.

In contrast, my code invoked schema, apply_schema, and modelcols consistent with current recommendations for StatsModels; those methods are even in commented out code in the previous fit function. At least as I did it, they got no clues they were operating on a RegressionModel.

P.S. The one model column my test code produces is the result interpreting the formula very naively; it is the result of adding the 3 columns a, b and c and then squaring the result, element by element. So it is not only ^ but + that get interpreted as ordinary arithmetic, rather than special terms in the formula.

1 Like

If I change the invocation from

ts = apply_schema(f, s)

to

ts = apply_schema(f, s, RegressionModel)

(the last argument is the type that the modified handling of ^ applies to), things work as expected. The output is now

formula: a ~ :((a + b + c) ^ 2)
  with terms Term[a, b, c]
schema: StatsModels.Schema with 3 entries:
  b => b
  a => a
  c => c
apply_schema: a ~ 1 + a + b + c + a & b + a & c + b & c with coefficient names("a", ["(Intercept)", "a", "b", "c", "a & b", "a & c", "b & c"])
modelcols: the the result has size (3, 7) and type Matrix{Float64}
Model Matrix:
[1.0 1.0 0.0 0.5 0.0 0.5 0.0; 1.0 2.0 0.0 2.5 0.0 5.0 0.0; 1.0 3.0 1.0 20.0 3.0 60.0 20.0]
1 Like

Thanks for bringing this up. I raised an issue in this repo 2 years ago.
If we want ppl to use Julia for serious stats work I think it is crucial to make these types of operations composable for stats-work throughout the entire Julia ecosystem:

@bkamins are there convenient-composable ways to do these things (raise a formula to a power, create interactions w/ factor variables, make arbitrary lead/lags) in the Julia ecosystem?
-By composable I mean I can do the same thing whether I’m working w/ FixedEffects.jl or a time-series package…

I think we still need to work on this. Probably @kleinschmidt is the best person to ask about the current status.

1 Like

lead and lag are supported by StatsModels: Temporal variables and Time Series Terms · StatsModels.jl. Interactions with factor (categorical) variables can be obtained using a*b or a&b just like with numeric variables. Do you have particular requirements that these options don’t meet?

I went into some detail in the issue I raised in the link above.
There are certain basic operations that economists use daily in STATA/R that are not yet convenient in the Julia ecosystem:

One example:
in STATA reg y F(-1 0 1 2).D
estimates y_{t} = \beta_{-1} x_{t-1} + \beta_{0} x_{t} + \beta_{1} x_{t+1} + \beta_{2} x_{t+2} +\varepsilon_{t}
I don’t know how to automatically do this (& the other things raised in the issue) in Julia

yeah, you got it :slight_smile: I think this is a docs issue if you had to go into the tests to figure it out. Specializing on the ‘context clues’ (model type we’re generating a formula for) is an important part of the mechanism that allows potentially competing claims to the same syntax.

we’d also consider changing RegressionFormulae.jl to use Mod::Type{<:Any} until someone complains.

also note that there are other situations where the behavior of apply_schema changes depending on the context, most notable the logic that promotes categorical terms to “full dummy” coding when necessary (only happens for Mod::Type{<:StatisticalModel} since it may or may not make sense to do that promotion in more ML-flavored situations).

1 Like