How to remove specific collumn from ModelFrame (StatsModels.jl)?

Hello!

I have ModelFrame from data:

mf = ModelFrame(@formula(y ~ 1 + x*y), df))

where ‘x’ and ‘y’ - categorical. But I have rank-deficient modelmatrix(mf) and want to drop rank-deficient columns, I can identify them, but have no idea how to rebuild ModelFrame with right coefnames ets? Any idea how to do that?

Hi!
Sharing from experience: If you give a runnable example (MWE) and some sort of indication of desired output for that example - the answers are so much quicker.

Hi!
I need to make ModelFrame that give me full-rank design matrix.

For example I have data:

using StatsModels, LinearAlgebra

var = rand(10)
fe1 = ["a","a","a","a","b","b","b","b","b","b"] 
fe2 = ["1","1","2","2","1","1","2","2","2","2"]
df = DataFrame(var = var, fe1 = fe1, fe2 = fe2)

f   = @formula(var ~ fe1 & fe2)
s   = schema(f, df)
as  = apply_schema(f, s, LinearModel)
mf  = ModelFrame(as, s, df, LinearModel)
mm  = modelmatrix(mf) # < This is rank-deficient
cn  = coefnames(mf)

# I can find all linear independent columns:
qrd = qr(mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

# and i can get good  design matrix for modelling and coefnames vector
fr_mm = mm[:, cols]
fr_cn = cn[cols]

# do something with mm  (ModelFrame)
# ???

fr_mm  == modelmatrix(mf) # should be true

Then I want to modify mf (ModelFrame) and remove linear dependent columns (5) to get good modelmatrix (mm = modelmatrix(mf)) any time further, how to do that?? - this is main question.

I think the columns in the QR factorization are permuted from the factorized matrix, and so indexing the mm matrix with cols doesn’t give anything useful.

The problem with finding a set of independent columns is that there are many sets of independent columns and it isn’t clear which one to choose.

You can find collinear columns with QR decomposition:“The Behavior of the QR-Factorization Algorithm with Column Pivoting” by Engler (1997)).

In this example I can improve:

qrd = qr(mm' * mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

My question not about how to find collinear columns (I think it can be done not only with QR, with SVD or spectral decomposition), I ask about how to drop certain column from ModelFrame?

ModelFrame is deprecated:

The ModelFrame and ModelMatrix types can still be used to do this transformation, but this is only to preserve some backwards compatibility. Package authors who would like to include support for fitting models from a @formula are strongly encouraged to directly use schema, apply_schema, and modelcols to handle the table-to-matrix transformations they need.

Why do you need it? Instead you could call modelcols(as, df) and coefnames(as). This is what GLM.jl does on git master for example.

Hi! Thank you! So… Question can be changed: what can i do with FormulaTerm to get full-rank matrix further:


var = rand(10)
fe1 = ["a","a","a","a","b","b","b","b","b","b"] 
fe2 = ["1","1","2","2","1","1","2","2","2","2"]
df = DataFrame(var = var, fe1 = fe1, fe2 = fe2)

f     = @formula(var ~ fe1 & fe2)
s     = schema(f, df)
as    = apply_schema(f, s, LinearModel)
Y,mm  = modelcols(as, df) 
rn,cn = coefnames(as)

qrd = qr(mm' * mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

fr_mm = mm[:, cols]
fr_cn = cn[cols]

# do something with as  (FormulaTerm)
# ???
# some code

Y,mm  = modelcols(as, df)
fr_mm  == mm  # true

P.S. Suppose, I support ModelFrame because I use assign from ModelMatrix

mf     = ModelFrame(f, sch, data, MetidaModel)
mm    = ModelMatrix(mf)
# then somewhere:
mm.assign

Ah so what you want is to be able to compute the equivalent of mm.assign? Then I think you can do the same as this (t being as.rhs):

And then you can drop the entries corresponding to the columns you dropped.

But I don’t think you can nor need to modify the formula itself.

Hi! Thank you very much for advice about assign, but it is not main question) I still ask about

how to modify FormulaTerm to get full-rank design matrix

.

So why I need exactly that:

I am making package for statistical models and it can works only with full-rank design matrices (I try to made tweaks to avoid problems, but it not works or works very unstable. and only one appropriate way - to use full-rank matrix).

So… I use StatModels and now store ModelFrame in model object for compatibility with some packages like Effects.jl
Now I take design matrix (X) with StatsModels from ModelFrame (and change that how you advice) and work with it, but if I need full-rank design matrix - I should modify it, and after that it no more corresponds to object I saved (ModelFrame or FormulaTerm) and if other package try to get coefficients from FormulaTerm it would not match coefficients estimated from model.

So, for correct working I need to modify FormulaTerm to make it structure corresponsable to real coefficients and covariance.
Other way - to drop support of StatsModels, but in this way I lost ability to use Effects.jl
Now I think that working around StatsModels is still appropriate way for package developing… thats why I try to get way to modify FormulaTerm to get full-rank design matrix

OK. Unfortunately that doesn’t seem easy, in particular because it implies the interaction between multiple terms rather than a single term. @dave.f.kleinschmidt has an idea.

A possible hack would be to fit the model using the modified matrix with collinear columns dropped, but still return the full coefficients and covariance matrix with NaN for omitted columns.

1 Like

Thank you! I’ll try to come up with something.

Hi!

I add this to my Proj:

asgn(f::FormulaTerm) = asgn(f.rhs)
asgn(t) = mapreduce(((i,t), ) -> i*ones(StatsModels.width(t)),
                    append!,
                    enumerate(StatsModels.vectorize(t)),
                    init=Int[])

so I have:

 Proj.asgn(f)
6-element Vector{Int64}:
 1
 1
 1
 1
 1
 1

and

julia> StatsModels.asgn(f)
6-element Vector{Int64}:
 2
 2
 3
 3
 3
 4

Is any thoughts how to fix it?