Should model matrix for nested factors be full-rank?

In some cases when levels of one factor all “inside” levels of another we have not full-rank model matrix from StatsModels. For example - calssical bioequivlence crossover study (2X2) with factors: Var ~ Formulation + Sequence + Period + Subject - levels of Subject distributed in Sequnce. So we get result bellow.

Should we get operator for @formula to make coding inside each level of top level factor?

:(log(Cmax)) ~ 1 + Sequence + Period + Formulation + Subject

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                     Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)      6.46193      0.077827    83.03    <1e-28    6.30052       6.62333
Sequence: tr    -0.0143149    0.105746    -0.14    0.8935   -0.233619      0.204989
Period: 2        0.039951     0.0305263    1.31    0.2041   -0.0233566     0.103259
Formulation: t   0.0358771    0.0305263    1.18    0.2524   -0.0274304     0.0991847
Subject: 2       0.518319     0.105746     4.90    <1e-04    0.299015      0.737623
Subject: 3       0.301846     0.105746     2.85    0.0092    0.082542      0.52115
Subject: 4       0.0409981    0.105746     0.39    0.7020   -0.178306      0.260302
Subject: 5      -0.596241     0.105746    -5.64    <1e-04   -0.815545     -0.376937
Subject: 6      -0.950327     0.105746    -8.99    <1e-08   -1.16963      -0.731023
Subject: 7       0.115056     0.105746     1.09    0.2884   -0.104247      0.33436
Subject: 8       0.61519      0.105746     5.82    <1e-05    0.395887      0.834494
Subject: 9       0.674252     0.105746     6.38    <1e-05    0.454948      0.893555
Subject: 10      0.843855     0.105746     7.98    <1e-07    0.624551      1.06316
Subject: 11      0.32713      0.105746     3.09    0.0053    0.107826      0.546434
Subject: 12      0.556382     0.105746     5.26    <1e-04    0.337078      0.775686
Subject: 13     -0.277867     0.105746    -2.63    0.0154   -0.49717      -0.0585627
Subject: 14     -0.690512     0.105746    -6.53    <1e-05   -0.909816     -0.471208
Subject: 15      0.0        NaN          NaN       NaN     NaN           NaN
Subject: 16     -0.0551107    0.105746    -0.52    0.6075   -0.274415      0.164193
Subject: 17     -0.324189     0.105746    -3.07    0.0057   -0.543493     -0.104885
Subject: 18     -0.237008     0.105746    -2.24    0.0354   -0.456312     -0.017704

I don’t really understand what you propose. Could you give more details? What kind of operator would you suggest?

For example, we have data and code bellow. Result will be simmilar to posted above. Subject: 1 not included in model matrix (this is a base level), but Subject: 15 included and it make collinearity. So, if make DummyCoding separetly for subjects in TR sequence and in RT sequence there will be no problems.

Code:

Using DataFrames, CSV, GLM
dfrds        = CSV.File("rds1.csv", types = Dict(:Var => Float64, :Subject => String, :Period => String, :Sequence => String, :Formulation => String )) |> DataFrame
olscmax  = lm(@formula(log(Var) ~ Sequence+Period+Formulation+Subject), dfrds; dropcollinear = true,)

Data:

Subject	Sequence	Period	Formulation	Var
3	TR	1	T	225.95
1	RT	1	R	181.09
2	RT	1	R	114.48
4	RT	1	R	176.91
5	TR	1	T	147.01
6	TR	1	T	97.53
7	RT	1	R	146.60
8	TR	1	T	45.58
9	RT	1	R	109.20
10	RT	1	R	125.61
11	TR	1	T	92.26
12	RT	1	R	237.95
13	TR	1	T	145.46
14	TR	1	T	179.96
15	TR	1	T	173.86
16	RT	1	R	144.00
17	RT	1	R	185.10
18	TR	1	T	117.99
1	RT	2	T	210.14
2	RT	2	T	98.72
3	TR	2	R	241.09
4	RT	2	T	186.65
5	TR	2	R	139.56
6	TR	2	R	124.77
7	RT	2	T	137.62
8	TR	2	R	57.71
9	RT	2	T	139.36
10	RT	2	T	120.43
11	TR	2	R	116.10
12	RT	2	T	228.63
13	TR	2	R	165.09
14	TR	2	R	181.09
15	TR	2	R	206.66
16	RT	2	T	143.25
17	RT	2	T	192.22
18	TR	2	R	125.50

I tried in with LinearRegressionKit.jl.

using LinearRegressionKit, StatsModels, DataFrames, CSV

dfrds        = CSV.File("rds1.csv", types = Dict(:Var => Float64, :Subject => String, :Period => String, :Sequence => String, :Formulation => String )) |> DataFrame

olscmax  = regress(@formula(log(Var) ~ Sequence+Period+Formulation+Subject), dfrds, req_stats=["default", "vif"])

and I get ok (as you will see, there are some warnings because some of your categories have no observation, which is probably because you have extracted only a part of the dataset). I also added the VIF to have something about the potential collinearity. I haven’t looked in much detail, but the lack of observations for some categories probably gives weird results in GLM. I would recommend using FreqTables.jl from @nalimilan to get a more precision where the problem lies.

Model definition:       :(log(Var)) ~ 1 + Sequence + Period + Formulation + Subject
Used observations:      23
Model statistics:
  R²: 0.991697                  Adjusted R²: 0.908672
  MSE: 0.013198                 RMSE: 0.114883
  σ̂²: 0.013198
Confidence interval: 95%

Coefficients statistics:
 Terms ╲ Stats │       Coefs      Std err            t     Pr(>|t|)       low ci      high ci          VIF
───────────────┼──────────────────────────────────────────────────────────────────────────────────────────
(Intercept)    │     5.26434    0.0938013      56.1223  0.000317339      4.86075      5.66793          0.0
Sequence: TR   │     -8.2708    3.85482e6  -2.14557e-6     0.999998    -1.6586e7     1.6586e7   6.60528e15
Period: 2      │   0.0122562    0.0741564     0.165275     0.883923    -0.306813     0.331326      1.63043
Formulation: T │  0.00583091    0.0741564    0.0786299     0.944486    -0.313238       0.3249       2.3913
Subject: 10    │   -0.431159     0.148313     -2.90709     0.100759      -1.0693      0.20698       1.5942
Subject: 11    │     7.52524    3.85482e6   1.95216e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 12    │     0.20772     0.148313      1.40055     0.296333    -0.430419     0.845859       1.5942
Subject: 13    │     7.98053    3.85482e6   2.07027e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 14    │     8.19337    3.85482e6   2.12549e-6     0.999998    -1.6586e7     1.6586e7   1.10088e15
Subject: 15    │     8.15888    3.85482e6   2.11654e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 16    │   -0.294527     0.148313     -1.98585     0.185443    -0.932666     0.343612       1.5942
Subject: 17    │  -0.0434443     0.148313    -0.292923     0.797177    -0.681583     0.594694       1.5942
Subject: 18    │     7.77123    3.85482e6   2.01598e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 2     │    -0.60704     0.114883       -5.284    0.0339997     -1.10134     -0.11274      1.82609
Subject: 3     │     8.45016    3.85482e6    2.1921e-6     0.999998    -1.6586e7     1.6586e7   2.10168e15
Subject: 4     │  -0.0709459     0.114883    -0.617551     0.599816    -0.565246     0.423354      1.82609
Subject: 5     │     7.96192    3.85482e6   2.06544e-6     0.999999    -1.6586e7     1.6586e7   2.10168e15
Subject: 6     │     7.58079    3.85482e6   1.96657e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 7     │   -0.276633     0.148313      -1.8652     0.203152    -0.914771     0.361506       1.5942
Subject: 8     │      6.8201    3.85482e6   1.76924e-6     0.999999    -1.6586e7     1.6586e7   1.10088e15
Subject: 9     │   -0.571159     0.148313     -3.85104    0.0612943      -1.2093    0.0669793       1.5942

Thought if you want to be specific about which category is the base you can specify it with the contrast (same format as in GLM). you can also find an example here: regression example with contrasts

olscmax  = regress(@formula(log(Var) ~ Sequence+Period+Formulation+Subject), 
            dfrds, req_stats=["default", "vif"], 
            contrasts= Dict(:Subject => DummyCoding(base="1") ))

Hi! Thanks a lot for LinearRegressionKit.jl, it works perfect! :ok_hand:
But this problem is not because data is incomplete. It is complete. This happens because we have in same time Sequence factor and Subject, and Subject is nested with Sequence - One level of Subject corresponds to only one level of Sequence. In some software (IBM SPSS for example) for this case used syntax like this: /DESIGN=Sequence Period Formulation Subject(Sequence) - this means that even Subject have same levels they coded separately for each Sequence, and also this means that DummyCoding (for n-1 factor levels) used for each Sequence (so if we have 2 sequences - we have n - 2 subject levels, if 3 sequences we have n - 3 subject levels ets.).

So, one more example with data above:

Using DataFrames, CSV, GLM
model = lm(@formula(log(Var) ~ Sequence+Period+Formulation+Subject), dfrds; dropcollinear = true,)
using MixedAnova
glm_init()
MixedAnova.anova(model, type = 3)

ANOVA results:

Analysis of Variance

Type 3 test / F test

:(log(Var)) ~ 1 + Sequence + Period + Formulation + Subject

Table:
───────────────────────────────────────────────────────────
             DOF   Exp.SS  Mean Square    F value  Pr(>|F|)
───────────────────────────────────────────────────────────
(Intercept)    1  49.86        49.86    7795.7027    <1e-22
Sequence       1   1.7847       1.7847   279.0437    <1e-10
Period         1   0.0453       0.0453     7.0905    0.0170
Formulation    1   0.0228       0.0228     3.5725    0.0770
Subject       17
(Residuals)   16   0.1023       0.0064
───────────────────────────────────────────────────────────

We see: wrong DOF for Subject, absence SS and MS for Subject, wrong F value for Sequence (because Sequence is inter-subject factor and F should be = MSsequence / MSsubject), not residual. It is not problem of coding but it can help packages like MixedAnova.jl correctly handle this data structure.

And the question: do we need in StatsModels some operator like var ~ factor1 + factor2 in factor1 to handle this models correctly?

Thanks for providing more details.
If I understand correctly, you’re looking for nested effects.
I believe they are defined in MixedModels.jl; there is a documentation here I believe they use the / as it builds on the syntax from R.

1 Like

Yep! I use it when possible MixedModels, Metida, GLM, LsqFit, FixedEffectsModels, ets… But sometime by statistical plan I should use ANOVA. And there is no problem for me - for some task I can use R or SPSS or write self-implemented anova. But even in self implemented tasks I found that I can’t make full-rank model matrix with StatsModels for some common cases. So maybe additional functionality for StatsModels help developers to make appropriate implementation of ANOVA…

@PharmCat have you seen GitHub - kleinschmidt/RegressionFormulae.jl ?

I think part of the idea is to automatically drop columns corresponding to unestimable contrasts, e.g. when certain combinations of levels from different categorical predictors never co-occur – missing or empty cells in a quasi-factorial design. Currently, MixedModels.jl and GLM.jl both handle this case as part of more general pivoting / rank deficiency detection, but it would be good to support dropping clearly unestimable columns (e.g., constant and constant multiples of other columns) from the model matrix or even prevent the allocation of such columns. I’ve talked to @dave.f.kleinschmidt a bit about this and part of the problem at the moment IIRC is that this falls in awkwardly between apply_schema (schema-time) and modelcols (data-time) operations.

2 Likes

Yes! Thank you. That’s exactly what i wanted to say. :joy: Second part of suggestion - decide - do we need operator in StatModels or in RegressionFormulae that make model constructing more clear for nested factors. I look at RegressionFormulae it gives ^ and / - and it just expand term to others. Idea is make some operator for example with following mechanism:

model var~ sequence + subject ∈ sequence, contrasts = Dict(:subject => DummyCoding(base = Dict(“A” => 1, “B” => 4))))

data:

var sequence subject
. A 1
. A 2
. A 3
. B 4
. B 5
. B 6

should give model matrix:

1 0 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 1 0 0 0 0
1 1 0 0 1 0
1 1 0 0 0 1

that is full-ranked for coefs: int, sequence: B, subject(sequence: A): 2, subject(sequence: A): 3, subject(sequence: B): 5, subject(sequence: B): 6

To unpack this a bit more: currently, we have three stages between @formula(...) and actual matrices that get passed to GLM.jl etc.:

  1. The formula expr is parsed into Terms (purely syntactic operations, nothing known about types of variables)
  2. A schema is extracted from the data: categorical (i.e. non-numeric) variables are detected adn their unique values are extracted, one column at a time. This schema is then applied to the parsed Terms, replacing Term(:a) with e.g. CategoricalTerm(:a, ...).
  3. These “concrete terms” can now be fed a table of data and will generate the numeric matrices

The issue is that, in order to support this kind of operator (create different contrasts given some kind of special syntax), we’d need to know about that syntax BEFORE the schema is computed; when the schema is applied, the full dataset is not available (by design). But special syntax is (at the moment) supported by operations that take place during schema application.

There are a number of reasons why this is not great, all with a similar shape to this: you want to somehow control the way you extract teh necessary invariants from the data to make the schema based on some special syntax/operators/model context. Some discussion here: RFC: add an additional apply_context stage · Issue #154 · JuliaStats/StatsModels.jl · GitHub

In the short term, I’m afraid the best solution is to create a new variable where you just string the pairs together; then the contrasts shoudl be computed correctly.

1 Like

Hi! Thank you for the great explanation!

I try to intercept in function term to make specific term with this function for better nested contrast :

function StatsModels.apply_schema(
    t::FunctionTerm{typeof(∈)},
    sch::Schema,
    Mod::Type{<:RegressionModel},
    )
    length(t.args_parsed) == 2 ||
        throw(ArgumentError("malformed term: $t (Exactly two arguments required)"))

    #=
    How to get data here?
    =#
    #Make specific term NestedTerm(sym, contrasts, base)
end

But how to get access to data here? There are too many apply_schema in StstsModels…

Or should it be done in schema stage? Or should I reload concrete_term?

StstsModels is a very confusing package - so I can’t to trace the way of FunctionTerm{typeof(∈)}

Hi! Could you kindly clarify - what is the status of this issue?

You’re lookin’ at it (in the issue :wink: ). It’s a pretty disruptive change so it needs some serious time to sit down and get it right and there’s a number of higher priority things. Plus I write julia all day long at my day job now so there’s not as much bandwidth to push this stuff forward as much as I’d like!

RE “how to get data here”: you can’t, and that’s by design. apply_schema does just that: applies teh information in the schema to transform terms into different terms. If the information you need isn’t present in the schema, you can’t do what you’re trying to do I’m afraid.

The reason that constraint is important (IMO) is that we want to support streaming/online/out of core stuff too, or at least not prevent support for it in the future. So lots of stuff that seems reasonable (like having access during apply_schema to the full dataset, which are usually pretty small data frames in teh grand scheme of things) are not allowed.

It’s an analogous situation to the splines situation (Safe predictions using `@formula` and regression splines · Issue #168 · JuliaStats/StatsModels.jl · GitHub).

If we had a separate “apply context” stage, you could hook in there to create your NestedTerm, and then we could specialize schema extraction on these custom term types. I do think that’s a better design overall but like I said above, don’t really have the bandwidth to really think through how to make it work without adding TOO much extra complexity (which, as you already pointed out, already makes StatsModels.jl an intimidating package for many developers)