Convenient way to generate all interaction terms up to a specified order

Ross_Boylan · June 21, 2019, 9:01pm

Given 4 variables, is there a convenient way to generate a formula with all main effects and all interactions up to 3-way interactions? I believe in R it would be ~(b+c+d+e)^3. In julia that syntax appears to be interpreted as an ordinary function, i.e., cube the sum of the variables. Since categorical variables don’t sum, that eventually produces an error.

julia> using StatsModels
julia> ff = @formula(1~(b+c+d+e)^3)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> s = schema(ff, xn)  # all vars are categorical
Dict{Any,Any} with 4 entries:
  e => e
  b => b
  c => c
  d => d

julia> ts = apply_schema(ff, s)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> xx = modelcols(ts, xn)
ERROR: MethodError: no method matching +(::CategoricalString{UInt8}, ::CategoricalString{UInt8})

As far as I can tell, @formula doesn’t support any shortcut notation for all interactions up to order n directly (excepting a main effects only or fully saturated model), and my searching hasn’t turned up any discussion of this topic.

Thanks.
Ross

dave.f.kleinschmidt · July 3, 2019, 8:54pm

That’s right, this isn’t currently supported. But with the power of Terms 2.0 you could implement this behavior fairly easily by overloading apply_schema(t::FunctionTerm{typeof(^)}, sch, ctx) to generate the interactions you want. If you’re willing to take a dependency on Combinatorics.jl then you can do it really easily. The basic idea is to return a tuple of the main effect and interaction terms you’d like to include (I’ve used & to create the interaction terms because it’s smart about treating a one-element interaction term as a no-op).

using StatsModels, Combinatorics, Iterators

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

(the method for StatsModels.FullRank is a bit of an unfortunate consequence of the way that the schema is applied for subtypes of StatisticalModel; it’s necessary to avoid a method ambiguity).

Then you can do things like:

julia> f = @formula(y ~ 1 +(a+b+c+d) ^ 2)
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  (a,b,c,d)->(a + b + c + d) ^ 2

julia> ff = apply_schema(f, schema(f, d))
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2→1)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

julia> modelcols(ff.rhs, d)
16×11 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  0.0  1.0  1.0  0.0  1.0  1.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

If you use a formula like this in the context of a statistical model, you’ll get the nice defaults like automatically adding an intercept term if one is not specified, and upgrading the first categorical predictor to “full rank” if no intercept is explicitly specified with 0:

julia> apply_schema(@formula(y ~ (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2→1)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

julia> apply_schema(@formula(y ~ 0 + (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  0
  a(StatsModels.FullDummyCoding:2→2)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

Note that this all requires a slight tweak to StatsModels to support the nested tuples of terms so it won’t work out of the box unless you ] dev StatsModels#dfk/sumcleanup. You could get around that by creating a special term struct like ExpTerm that (analogously with InteractionTerm) controls the behavior of this term in different contexts in more detail, but that would require a lot of extra methods (including more specific handling of FullRank schemas).

dave.f.kleinschmidt · September 28, 2019, 9:27am

The sum-cleanup PR has been merged so this should work out of the box now (at least on master, although a release is planned for this weekend)

Topic		Replies	Views
StatsModels R-like formula for a general quadratic model Statistics	7	2393	August 7, 2021
Getting RegressionFormulae ^ (interaction) operator to work Statistics	8	641	May 13, 2022
Using all independent variables with @formula in a multiple linear model New to Julia glm	18	4424	January 29, 2023
Is there a better DSL (domain-specific language) for defining a formula in linear models? Statistics	5	1184	April 1, 2019
Summing (StatsModels) terms General Usage question	10	634	October 11, 2020

Convenient way to generate all interaction terms up to a specified order

Related topics