Convenient way to generate all interaction terms up to a specified order

Given 4 variables, is there a convenient way to generate a formula with all main effects and all interactions up to 3-way interactions? I believe in R it would be ~(b+c+d+e)^3. In julia that syntax appears to be interpreted as an ordinary function, i.e., cube the sum of the variables. Since categorical variables don’t sum, that eventually produces an error.

julia> using StatsModels
julia> ff = @formula(1~(b+c+d+e)^3)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> s = schema(ff, xn)  # all vars are categorical
Dict{Any,Any} with 4 entries:
  e => e
  b => b
  c => c
  d => d

julia> ts = apply_schema(ff, s)
FormulaTerm
Response:
  1
Predictors:
  (b,c,d,e)->(b + c + d + e) ^ 3

julia> xx = modelcols(ts, xn)
ERROR: MethodError: no method matching +(::CategoricalString{UInt8}, ::CategoricalString{UInt8})

As far as I can tell, @formula doesn’t support any shortcut notation for all interactions up to order n directly (excepting a main effects only or fully saturated model), and my searching hasn’t turned up any discussion of this topic.

Thanks.
Ross

3 Likes

That’s right, this isn’t currently supported. But with the power of Terms 2.0 you could implement this behavior fairly easily by overloading apply_schema(t::FunctionTerm{typeof(^)}, sch, ctx) to generate the interactions you want. If you’re willing to take a dependency on Combinatorics.jl then you can do it really easily. The basic idea is to return a tuple of the main effect and interaction terms you’d like to include (I’ve used & to create the interaction terms because it’s smart about treating a one-element interaction term as a no-op).

using StatsModels, Combinatorics, Iterators

combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)
expand_exp(args, deg::ConstantTerm) =
    tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.FullRank, ctx::Type) =
    apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

(the method for StatsModels.FullRank is a bit of an unfortunate consequence of the way that the schema is applied for subtypes of StatisticalModel; it’s necessary to avoid a method ambiguity).

Then you can do things like:

julia> f = @formula(y ~ 1 +(a+b+c+d) ^ 2)
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  (a,b,c,d)->(a + b + c + d) ^ 2

julia> ff = apply_schema(f, schema(f, d))
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2→1)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

julia> modelcols(ff.rhs, d)
16×11 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  0.0  1.0  1.0  0.0  1.0  0.0  1.0  0.0
 1.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  0.0  1.0  1.0  0.0  1.0  1.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

If you use a formula like this in the context of a statistical model, you’ll get the nice defaults like automatically adding an intercept term if one is not specified, and upgrading the first categorical predictor to “full rank” if no intercept is explicitly specified with 0:

julia> apply_schema(@formula(y ~ (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  a(DummyCoding:2→1)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

julia> apply_schema(@formula(y ~ 0 + (a+b+c+d)^2), sch, StatisticalModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  0
  a(StatsModels.FullDummyCoding:2→2)
  b(DummyCoding:2→1)
  c(DummyCoding:2→1)
  d(DummyCoding:2→1)
  a(DummyCoding:2→1) & b(DummyCoding:2→1)
  a(DummyCoding:2→1) & c(DummyCoding:2→1)
  a(DummyCoding:2→1) & d(DummyCoding:2→1)
  b(DummyCoding:2→1) & c(DummyCoding:2→1)
  b(DummyCoding:2→1) & d(DummyCoding:2→1)
  c(DummyCoding:2→1) & d(DummyCoding:2→1)

Note that this all requires a slight tweak to StatsModels to support the nested tuples of terms so it won’t work out of the box unless you ] dev StatsModels#dfk/sumcleanup. You could get around that by creating a special term struct like ExpTerm that (analogously with InteractionTerm) controls the behavior of this term in different contexts in more detail, but that would require a lot of extra methods (including more specific handling of FullRank schemas).

2 Likes

The sum-cleanup PR has been merged so this should work out of the box now (at least on master, although a release is planned for this weekend)

1 Like