GLMM - Cholesky factorization failed

laut · July 29, 2020, 10:44pm

I’m not sure what I’m doing wrong here to fit a Generalized Linear Mixed Model (@dmbates):

using CSV, DataFrames, Statsbase, MixedModels
orange = DataFrame(CSV.read(download(“https://raw.githubusercontent.com/timothyslau/misc/master/orange.csv”)))
categorical!(orange,:Tree)
orange.age_z = zscore(orange[:,:age])
fit(GeneralizedLinearMixedModel, @formula(circumference ~ 1 + age_z + (1 | Tree)), orange, Poisson(), LogLink())

But I’m getting an error:

PosDefException: matrix is not positive definite; Cholesky factorization failed.

But this runs just fune:

fit(LinearMixedModel, @formula(circumference ~ 1 + age + (1 | Tree)), orange)

laut · July 29, 2020, 10:48pm

Update: If I add the kwarg: fast=true it works.

Tamas_Papp · July 30, 2020, 9:16am

Generally it is best to check if there is an existing issue, if not, report it.

palday · July 30, 2020, 12:31pm

Thanks for raising an issue and posting the data as well. (I don’t check Discourse very often, but I do get GitHub notifications for issues.) To complete the cross-referencing, this is the issue on GitHub.

I’ll answer a little bit here because I don’t think this is a bug, but rather a data/model issue. The difference between the fast and slow fits for GLMM is whether both the random effects and the fixed effects are optimized. For LMM, we only need to optimize for the random effects and get the fixed effects estimates more or less for free. (I’m oversimplifying a bit here.) For GLMM, that leads to a less accurate fit because the fixed effects estimates are conditional on the random effects. (The link function basically makes it so that you get conditional instead of marginal estimates; there has been some discussion on this in various R fora, especially surrounding GLMMadaptive which can give you the marginal estimates as well as the conditional estimates.) So you have optimize over the fixed and random effects, but this greatly increases the size of the parameter space and thus makes for a slower fit. When fast=true, only the random effects are optimized, potentially greatly reducing the size of the parameter space at the cost of accuracy.

For many models, there is little difference between the two fits, but there are some pathological cases. We do have a few examples of models where the slow fit is unstable in some sense, but the fast fit is fine. To the best of my knowledge, all current examples of this pathology are Poisson family, but I don’t know why this is the case. The other pathology I have seen – noticeably different estimates – does occur in binomial models.

For now, I would check to see whether the fast fit does a good job capturing your data, by e.g. comparing fitted vs. observed. If that’s the case, then the fast fit is probably fine.

palday · July 30, 2020, 7:31pm

Somewhat related: why are you using the Poisson family for a continuous measure like circumference? Usually you see it for discrete things like event counts. Do you really expect the variance to be the same as mean (which is what the Poisson distribution does)? You can have a Normal() family GLMM with non-identity link if it’s the link you’re after (or you can consider transforming the data directly).

laut · July 30, 2020, 8:49pm

I was just trying out the GLM methods, a Gamma with Identity link is probably the right method in practice.

palday · July 30, 2020, 11:47pm

I tried out your sample data with a linear model and the fit was very good, but I don’t know if these were fake or real data and what the underlying theory says about their distribution.

I mention this mostly because we are aware of some issues with Gamma (and InverseGaussian) fits. It’s something I’m working on, but progress is slow at the moment because of more pressing concerns in my day job.

laut · July 31, 2020, 4:26pm

I tried out your sample data with a linear model and the fit was very good, but I don’t know if these were fake or real data and what the underlying theory says about their distribution.

It’s real data from the base datasets package in R:
datasets::Orange
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Orange.html

Properly it’s a non-linear model in theory. Ben Bolker uses it for his lme4::nlmer model example:

But a linear model gets you 90% of the way there, while being predictably off on the sigmoidal growth tails.

It’s something I’m working on, but progress is slow at the moment because of more pressing concerns in my day job.

No problem, I’m thankful for your work/efforts.

At some point, I have some functions/documentation from R for calculating the intraclass correlation coefficient ICC for these different model distributions that I think would be worth building into the MixedModels package. I’ll reach out on Github, I’ve been trying to learn Julia well enough to translate them, but maybe when you are less busy, working together, we can speed that process up

palday · August 5, 2020, 4:14pm

For the ICC stuff: waay cool.

Two big points on that:

I can’t look at your implementation without you relicensing it in a way that we can use, so BSD or MIT. (Or you can give me a one-off license to do what I want with it.) Because we’re going to follow the norm in the Julia community and use MIT license for the relevant package (see next point), which also means that I can hack on this during working hours when relevant for my day job.
We probably won’t include such things in MixedModels.jl itself, which we are trying to keep lean and focused on the core functionality. Because of JIT, the Julia community is much more focused on reducing unnecessary dependencies and playing extra code in a separate package (hence the prevalence of pairs of packages like StatsBase and Statistics). That said, we’ve been discussing creating a MixedModelsExtras package for such things (potentially also for things like marginal and conditional R^2, where are popular requests but which we don’t really want to strongly endorse). We also want to add some plotting functionality, but we haven’t decided which plotting packages to support (the notion of recipes in plotting is bringing us closer to not having to worry about that) and whether that should be part of the MixedModelsExtras package or get its own MixedModelsPlots package (because all the plotting libraries are fairly heavy dependencies).

laut · August 5, 2020, 5:59pm

MIT license switch complete.
That makes sense, anything to make the compile faster.
my email is timothyslau@gmail.com if you want to chat off thread.

Topic		Replies	Views
GLMM ERROR: UndefVarError: emηᵢ not defined Statistics question	0	255	June 13, 2022
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	5002	June 9, 2022
How to check if GLM will be able to run OLS beforehand? General Usage question , glm	8	505	November 1, 2020
Fastest way for GLMM Performance question , package	6	238	October 7, 2024
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5089	November 26, 2018

GLMM - Cholesky factorization failed

Related topics