Fastest way for GLMM

Hi,

I am conducting a single-cell eQTL analysis using Julia. I have several covariates (genotype PCs, sex, age) and a random effect represented by sample ID, as the single-cell data is quite structured. The model should follow a Poisson distribution. I used the MixedModels.jl package to fit the model with the formula:

@formula(exp ~ snp + gPC_1 + gPC_2 + gPC_3 + gPC_4 + gPC_5 + ePC_3 + ePC_4 + ePC_5 + sex + (1 | sampleid))

However, the fitting process takes over 5 seconds, and I have 4,000 SNPs for 10,000 genes. Is there a more efficient way to fit the model? Can I utilize a GPU or any other methods to improve performance?

Additionally, I have encountered NaN values in the model fitting for some associations due to optimization failures and convergence issues. Do you have any suggestions for improvement?

Thank you!

Have you tried the fast=true option? It’s slightly less accurate, but can be substantially faster.

How many observations do you have per model? Which predictors are categorical and how many levels do they have? How many levels do you have for sampleid?

Additionally, I have encountered NaN values in the model fitting for some associations due to optimization failures and convergence issues. Do you have any suggestions for improvement?

It depends. It would really help if you share the data (even privately) so that we can see what’s going on.

  1. Yes, I did apply fast=True.
  2. I am running the linear regression with more than one cell type. For 68 sample IDs (random effects), I have a couple of hundred cells for the rare cell type, but the largest cell cluster consists of a couple of thousand cells. Is having 68 random effects too many?

68 random effects is not many.

Can you let me know

  • how many observations total go into a model? i.e. the result of nobs(model)
  • how many fixed effects coefficients are being estimated? if you have categorical predictors, then this number can differ from the number of terms in the formula.

For the largest cell type, which has approximately 5,000 observations (“exp” in my formula), all fixed effect terms are continuous except for sex, which I coded as 0 and 1.

I already used @threads with 50 CPUs, but I need to improve the performance further.

Can you share a code snippet showing how you’re using @threads? Are you fitting a bunch of models in a loop? Then I would recommend doing something more subtle with Distributed so that BLAS and Julia threads aren’t competing with each other. If you’re just throwing @threads in front fit, then that won’t have any impact.

Thanks for suggesting using Distributed
But I am not fully understand how to use Distributed instead of threads

My script looked like this:

@threads for gene_index in gene_list
	@threads for snpIndex in 1:numSnp
		@threads for celltypeIndex in 1:length(celltypeList)
			m2 = fit(MixedModel, eval(Meta.parse(formula_alt_perSnp)), assocDf_celltype, Poisson(), fast = true, verbose= false, progress = false)
			m1 = fit(MixedModel, eval(Meta.parse(formula_null_perSnp)), assocDf_celltype, Poisson(), fast = true,  verbose= false, progress = false)
			anova_result = AnovaMixedModels.anova(m1,m2)
		end
	end
end

Could you please give some advices on this?