Fastest way for GLMM

Juh · September 26, 2024, 4:59am

Hi,

I am conducting a single-cell eQTL analysis using Julia. I have several covariates (genotype PCs, sex, age) and a random effect represented by sample ID, as the single-cell data is quite structured. The model should follow a Poisson distribution. I used the MixedModels.jl package to fit the model with the formula:

@formula(exp ~ snp + gPC_1 + gPC_2 + gPC_3 + gPC_4 + gPC_5 + ePC_3 + ePC_4 + ePC_5 + sex + (1 | sampleid))

However, the fitting process takes over 5 seconds, and I have 4,000 SNPs for 10,000 genes. Is there a more efficient way to fit the model? Can I utilize a GPU or any other methods to improve performance?

Additionally, I have encountered NaN values in the model fitting for some associations due to optimization failures and convergence issues. Do you have any suggestions for improvement?

Thank you!

palday · September 26, 2024, 7:03pm

Have you tried the fast=true option? It’s slightly less accurate, but can be substantially faster.

How many observations do you have per model? Which predictors are categorical and how many levels do they have? How many levels do you have for sampleid?

Additionally, I have encountered NaN values in the model fitting for some associations due to optimization failures and convergence issues. Do you have any suggestions for improvement?

It depends. It would really help if you share the data (even privately) so that we can see what’s going on.

Juh · September 26, 2024, 7:19pm

Yes, I did apply fast=True.
I am running the linear regression with more than one cell type. For 68 sample IDs (random effects), I have a couple of hundred cells for the rare cell type, but the largest cell cluster consists of a couple of thousand cells. Is having 68 random effects too many?

palday · September 26, 2024, 8:26pm

68 random effects is not many.

Can you let me know

how many observations total go into a model? i.e. the result of nobs(model)
how many fixed effects coefficients are being estimated? if you have categorical predictors, then this number can differ from the number of terms in the formula.

Juh · September 26, 2024, 9:02pm

For the largest cell type, which has approximately 5,000 observations (“exp” in my formula), all fixed effect terms are continuous except for sex, which I coded as 0 and 1.

I already used @threads with 50 CPUs, but I need to improve the performance further.

palday · September 27, 2024, 5:54pm

Can you share a code snippet showing how you’re using @threads? Are you fitting a bunch of models in a loop? Then I would recommend doing something more subtle with Distributed so that BLAS and Julia threads aren’t competing with each other. If you’re just throwing @threads in front fit, then that won’t have any impact.

Juh · October 7, 2024, 11:18pm

Thanks for suggesting using Distributed
But I am not fully understand how to use Distributed instead of threads

My script looked like this:

@threads for gene_index in gene_list
	@threads for snpIndex in 1:numSnp
		@threads for celltypeIndex in 1:length(celltypeList)
			m2 = fit(MixedModel, eval(Meta.parse(formula_alt_perSnp)), assocDf_celltype, Poisson(), fast = true, verbose= false, progress = false)
			m1 = fit(MixedModel, eval(Meta.parse(formula_null_perSnp)), assocDf_celltype, Poisson(), fast = true,  verbose= false, progress = false)
			anova_result = AnovaMixedModels.anova(m1,m2)
		end
	end
end

Could you please give some advices on this?

Topic		Replies	Views
Is there any glmmTMB package for Julia? Statistics	15	2518	April 28, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
GLMM - Cholesky factorization failed Statistics question	9	1067	August 5, 2020
Any Julia's equivalent to R's packages mcgv or mixed-effects models larger than memory? Statistics	9	2955	November 19, 2018
Fitting Mixed Effects Models - Python, Julia or R? Statistics blog-post	0	530	January 20, 2022

Fastest way for GLMM

Related topics