How to remove specific collumn from ModelFrame (StatsModels.jl)?

PharmCat · November 28, 2023, 8:12pm

Hello!

I have ModelFrame from data:

mf = ModelFrame(@formula(y ~ 1 + x*y), df))

where ‘x’ and ‘y’ - categorical. But I have rank-deficient modelmatrix(mf) and want to drop rank-deficient columns, I can identify them, but have no idea how to rebuild ModelFrame with right coefnames ets? Any idea how to do that?

Dan · November 28, 2023, 8:59pm

Hi!
Sharing from experience: If you give a runnable example (MWE) and some sort of indication of desired output for that example - the answers are so much quicker.

PharmCat · December 5, 2023, 11:58am

Hi!
I need to make ModelFrame that give me full-rank design matrix.

For example I have data:

using StatsModels, LinearAlgebra

var = rand(10)
fe1 = ["a","a","a","a","b","b","b","b","b","b"] 
fe2 = ["1","1","2","2","1","1","2","2","2","2"]
df = DataFrame(var = var, fe1 = fe1, fe2 = fe2)

f   = @formula(var ~ fe1 & fe2)
s   = schema(f, df)
as  = apply_schema(f, s, LinearModel)
mf  = ModelFrame(as, s, df, LinearModel)
mm  = modelmatrix(mf) # < This is rank-deficient
cn  = coefnames(mf)

# I can find all linear independent columns:
qrd = qr(mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

# and i can get good  design matrix for modelling and coefnames vector
fr_mm = mm[:, cols]
fr_cn = cn[cols]

# do something with mm  (ModelFrame)
# ???

fr_mm  == modelmatrix(mf) # should be true

Then I want to modify mf (ModelFrame) and remove linear dependent columns (5) to get good modelmatrix (mm = modelmatrix(mf)) any time further, how to do that?? - this is main question.

Dan · December 5, 2023, 12:56pm

I think the columns in the QR factorization are permuted from the factorized matrix, and so indexing the mm matrix with cols doesn’t give anything useful.

The problem with finding a set of independent columns is that there are many sets of independent columns and it isn’t clear which one to choose.

PharmCat · December 5, 2023, 2:17pm

You can find collinear columns with QR decomposition:“The Behavior of the QR-Factorization Algorithm with Column Pivoting” by Engler (1997)).

In this example I can improve:

qrd = qr(mm' * mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

My question not about how to find collinear columns (I think it can be done not only with QR, with SVD or spectral decomposition), I ask about how to drop certain column from ModelFrame?

nalimilan · December 5, 2023, 9:39pm

ModelFrame is deprecated:

The ModelFrame and ModelMatrix types can still be used to do this transformation, but this is only to preserve some backwards compatibility. Package authors who would like to include support for fitting models from a @formula are strongly encouraged to directly use schema, apply_schema, and modelcols to handle the table-to-matrix transformations they need.

Why do you need it? Instead you could call modelcols(as, df) and coefnames(as). This is what GLM.jl does on git master for example.

PharmCat · December 5, 2023, 11:36pm

Hi! Thank you! So… Question can be changed: what can i do with FormulaTerm to get full-rank matrix further:


var = rand(10)
fe1 = ["a","a","a","a","b","b","b","b","b","b"] 
fe2 = ["1","1","2","2","1","1","2","2","2","2"]
df = DataFrame(var = var, fe1 = fe1, fe2 = fe2)

f     = @formula(var ~ fe1 & fe2)
s     = schema(f, df)
as    = apply_schema(f, s, LinearModel)
Y,mm  = modelcols(as, df) 
rn,cn = coefnames(as)

qrd = qr(mm' * mm)
cols = findall(x-> abs(x) > 1e-8, diag(qrd.R))

fr_mm = mm[:, cols]
fr_cn = cn[cols]

# do something with as  (FormulaTerm)
# ???
# some code

Y,mm  = modelcols(as, df)
fr_mm  == mm  # true

P.S. Suppose, I support ModelFrame because I use assign from ModelMatrix

mf     = ModelFrame(f, sch, data, MetidaModel)
mm    = ModelMatrix(mf)
# then somewhere:
mm.assign

nalimilan · December 6, 2023, 8:45am

Ah so what you want is to be able to compute the equivalent of mm.assign? Then I think you can do the same as this (t being as.rhs):

github.com

JuliaStats/StatsModels.jl/blob/df862a601947de3912d190cb2f1d9e1d751ad4da/src/modelframe.jl#L219-L222


      
          asgn(t) = mapreduce(((i,t), ) -> i*ones(width(t)),
                              append!,
                              enumerate(vectorize(t)),
                              init=Int[])

And then you can drop the entries corresponding to the columns you dropped.

But I don’t think you can nor need to modify the formula itself.

PharmCat · December 6, 2023, 11:18am

Hi! Thank you very much for advice about assign, but it is not main question) I still ask about

how to modify FormulaTerm to get full-rank design matrix

.

So why I need exactly that:

I am making package for statistical models and it can works only with full-rank design matrices (I try to made tweaks to avoid problems, but it not works or works very unstable. and only one appropriate way - to use full-rank matrix).

So… I use StatModels and now store ModelFrame in model object for compatibility with some packages like Effects.jl
Now I take design matrix (X) with StatsModels from ModelFrame (and change that how you advice) and work with it, but if I need full-rank design matrix - I should modify it, and after that it no more corresponds to object I saved (ModelFrame or FormulaTerm) and if other package try to get coefficients from FormulaTerm it would not match coefficients estimated from model.

So, for correct working I need to modify FormulaTerm to make it structure corresponsable to real coefficients and covariance.
Other way - to drop support of StatsModels, but in this way I lost ability to use Effects.jl
Now I think that working around StatsModels is still appropriate way for package developing… thats why I try to get way to modify FormulaTerm to get full-rank design matrix

nalimilan · December 10, 2023, 4:39pm

OK. Unfortunately that doesn’t seem easy, in particular because it implies the interaction between multiple terms rather than a single term. @dave.f.kleinschmidt has an idea.

A possible hack would be to fit the model using the modified matrix with collinear columns dropped, but still return the full coefficients and covariance matrix with NaN for omitted columns.

PharmCat · December 12, 2023, 3:36pm

Thank you! I’ll try to come up with something.

PharmCat · December 24, 2023, 2:32pm

Hi!

I add this to my Proj:

asgn(f::FormulaTerm) = asgn(f.rhs)
asgn(t) = mapreduce(((i,t), ) -> i*ones(StatsModels.width(t)),
                    append!,
                    enumerate(StatsModels.vectorize(t)),
                    init=Int[])

so I have:

 Proj.asgn(f)
6-element Vector{Int64}:
 1
 1
 1
 1
 1
 1

and

julia> StatsModels.asgn(f)
6-element Vector{Int64}:
 2
 2
 3
 3
 3
 4

Is any thoughts how to fix it?

Topic		Replies	Views
Get columns of DataFrames.jl ModelMatrix object Statistics	14	2854	March 24, 2017
PSA: breaking changes in StatsModels v0.6.0 (Terms 2.0: Son of Terms) Statistics	6	1592	July 2, 2019
Should model matrix for nested factors be full-rank? Statistics	15	587	August 1, 2022
Diagnosing "Fixed-effects matrix is rank-deficient" Statistics	2	1036	June 24, 2021
Add column names to a GLM.LinearModel estimated in matrix form Statistics question , glm	11	1149	July 1, 2020

How to remove specific collumn from ModelFrame (StatsModels.jl)?

Related topics