Convenient way to generate all interaction terms up to a specified order

Given 4 variables, is there a convenient way to generate a formula with all main effects and all interactions up to 3-way interactions? I believe in R it would be ~(b+c+d+e)^3. In julia that syntax appears to be interpreted as an ordinary function, i.e., cube the sum of the variables. Since categorical variables don’t sum, that eventually produces an error.

julia> using StatsModels
julia> ff = @formula(1~(b+c+d+e)^3)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> s = schema(ff, xn)  # all vars are categorical
Dict{Any,Any} with 4 entries:
  e => e
  b => b
  c => c
  d => d

julia> ts = apply_schema(ff, s)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> xx = modelcols(ts, xn)
ERROR: MethodError: no method matching +(::CategoricalString{UInt8}, ::CategoricalString{UInt8})

As far as I can tell, @formula doesn’t support any shortcut notation for all interactions up to order n directly (excepting a main effects only or fully saturated model), and my searching hasn’t turned up any discussion of this topic.

Thanks.
Ross

2 Likes

That’s right, this isn’t currently supported. But with the power of Terms 2.0 you could implement this behavior fairly easily by overloading apply_schema(t::FunctionTerm{typeof(^)}, sch, ctx) to generate the interactions you want. If you’re willing to take a dependency on Combinatorics.jl then you can do it really easily. The basic idea is to return a tuple of the main effect and interaction terms you’d like to include (I’ve used & to create the interaction terms because it’s smart about treating a one-element interaction term as a no-op).

using StatsModels, Combinatorics, Iterators

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

(the method for StatsModels.FullRank is a bit of an unfortunate consequence of the way that the schema is applied for subtypes of StatisticalModel; it’s necessary to avoid a method ambiguity).

Then you can do things like:

julia> f = @formula(y ~ 1 +(a+b+c+d) ^ 2)
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  (a,b,c,d)->(a + b + c + d) ^ 2

julia> ff = apply_schema(f, schema(f, d))
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1)
  d(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & b(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)

julia> modelcols(ff.rhs, d)
16Γ—11 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  0.0  1.0  1.0  0.0  1.0  1.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

If you use a formula like this in the context of a statistical model, you’ll get the nice defaults like automatically adding an intercept term if one is not specified, and upgrading the first categorical predictor to β€œfull rank” if no intercept is explicitly specified with 0:

julia> apply_schema(@formula(y ~ (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1)
  d(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & b(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)

julia> apply_schema(@formula(y ~ 0 + (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  0
  a(StatsModels.FullDummyCoding:2β†’2)
  b(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1)
  d(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & b(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  a(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & c(DummyCoding:2β†’1)
  b(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)
  c(DummyCoding:2β†’1) & d(DummyCoding:2β†’1)

Note that this all requires a slight tweak to StatsModels to support the nested tuples of terms so it won’t work out of the box unless you ] dev StatsModels#dfk/sumcleanup. You could get around that by creating a special term struct like ExpTerm that (analogously with InteractionTerm) controls the behavior of this term in different contexts in more detail, but that would require a lot of extra methods (including more specific handling of FullRank schemas).

2 Likes