How to obtain numeric arrays from a linear mixed model formula?

Hello. I have a data frame (“dat”, with variables x1, y, and id) and a formula (@formula(y ~ 1 + x1 + (1 | id))). I want to obtain the following numeric arrays:

  • y: response vector
  • X: design matrix for fixed effect, with two columns: intercept and x1
  • Z: design matrix for random effects: here just intercept
  • id: cluster identifier

I’m aware that the StatsModels.jl package allows me to parse a fixed effect model with functions such as “schema, apply_schema, modelcols”.

My question is how do I do it with mixed effect models? Thank you.

I think the easiest way to obtain these arrays is to create the LinearMixedModel object and extract the properties

julia> using MixedModels, RCall

julia> m1 = LinearMixedModel(@formula(height ~ 1 + age + (1|Subject)), rcopy(R"nlme::Oxboys"));

julia> propertynames(m1)
(:formula, :sqrtwts, :A, :L, :optsum, :θ, :theta, :β, :beta, :λ, :lambda, :stderror, :σ, :sigma, :b, :u, :lowerbd, :X, :y, :rePCA, :reterms, :feterms, :objective, :pvalues)

julia> typeof(m1.y)   # y as a vector
Array{Float64,1}

julia> typeof(m1.X)  # X as a two-column matrix
Array{Float64,2}

julia> typeof(Matrix(first(m1.reterms))) # Z as a dense matrix (it is not actually used in this form)
Array{Float64,2}

You can obtain the grouping factor, Subject in this case, for the random effects term by taking apart the CategoricalTerm in the first of the reterms. (This is the only reterm in this case but the design allows for multiple random effects terms.) If you just want the indices into the levels you can get them explicitly as the refs property.

julia> propertynames(first(m1.reterms))
(:trm, :refs, :cnames, :z, :wtz, :λ, :inds, :adjA, :scratch)

julia> typeof(first(m1.reterms).refs)
Array{Int32,1}

julia> typeof(first(m1.reterms).trm)
StatsModels.CategoricalTerm{DummyCoding,String,25}

To get the levels you must take apart the trm itself.

julia> propertynames(first(m1.reterms).trm)
(:sym, :contrasts)

julia> propertynames(first(m1.reterms).trm.contrasts)
(:matrix, :termnames, :levels, :contrasts, :invindex)

I realize I may not be answering the question that you intended. If you want to find out how the random-effects terms are parsed then you should look at https://github.com/JuliaStats/MixedModels.jl/blob/master/src/randomeffectsterm.jl

1 Like

Thank you Prof. Bates. I tried your solution and it works great, so I think this is the way to go.

@dmbates solution will work, but since you asked how to do it using “standard” statsmodels functions, the trick is to specify that you want to use a MixedModel (or LinearMixedModel or GeneralizedMixedModel) context (the third argument) when you apply_schema:

julia> using MixedModels, StatsModels
[ Info: Precompiling MixedModels [ff71e718-51f3-5ec2-a782-8ffcbfa3c316]

julia> dat = (y = rand(100), x1 = rand(100), id = repeat([:a, :b, :c, :d], 25));

julia> f = @formula(y ~ 1 + x1 + (1|id))
FormulaTerm
Response:
  y(unknown)
Predictors:
  1
  x1(unknown)
  (id)->1 | id

julia> f_sch = apply_schema(f, schema(dat), MixedModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  x1(continuous)
  (1 | id)

julia> y, (X, Z) = modelcols(f_sch, dat)

(The last step uses tuple destructuring, sicne modelcols of a formula term returns a tuple of the left-hand and right-hand sides, and the right-hand size is itself a tuple the fixed effects matrix and one or more random effects matrices…)

Just to prove that this gives you what you were looking for:

julia> y == dat.y
true

julia> X == [ones(100) dat.x1]
true

julia> size(Z)
(100, 4)

julia> Z[:,1] == (dat.id .== :a)
true

julia> Z[:,2] == (dat.id .== :b)
true

julia> Z.refs[1:6]
6-element Array{Int32,1}:
 1
 2
 3
 4
 1
 2

julia> Z.levels
4-element Array{Symbol,1}:
 :a
 :b
 :c
 :d

julia> dat.id == Z.levels[Z.refs]
true

Note that Z is a special ReMat type that’s internal to MixedModels. It’s an AbstractArray but uses a special, compact representation of the Z matrix. If you want to get the full matrix you can do something like

julia> Matrix(Z)
100×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 ⋮
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0

which creates a dense matrix. If you want a sparse matrix, you can do:

julia> sparse(Z)
100×4 LinearAlgebra.Adjoint{Float64,SparseArrays.SparseMatrixCSC{Float64,Int32}}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 ⋮
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0

(It’s an Adjoint because ReMat also stores the adjoint of the sparse Z internally…)

1 Like

Also as a follow up, this is the first thing that MixedModels does internally when it’s creating a LinearMixedModel. See these lines in the source code.

2 Likes

Thank you Dave. This is great to know!

1 Like

Hi Dave,

The same code you used above returns an error here, for f_sch = apply_schema(f, schema(dat), MixedModel).
It reads “UndefVarError: schema not defined”. Using Julia 1.4.1, and MixedModels 1.1.6 and StatsModels 0.5.0.

Do you have any clue?

Do you have the line with

using MixedModels, StatsModels

in your example? The schema function is in the StatsModels namespace and is not re-exported by MixedModels. I suppose we could re-export it from MixedModels if this was important.

1 Like

Doug’s right, you’ll have to have using StatsModels somewhere. As for whether to re-export schema, I suppose it might be useful but at that point might as well just re-export everything. That seems like overkill though…

Yes I have the line “using MixedModels, StatsModels”

You say you are using MixedModels v1.1.6 which could be the problem. The current release version is 2.3.0

Ahh… I did not think of it coz I installed it just yesterday. Now I have it updated to the current release, and it works. Thanks for the help!

1 Like