GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels?

andreasnoack · November 26, 2018, 11:56am

I’m not a numeric analyst, but won’t the condition number of (X’ * X) \ (X’ * Y) be the square of that of X \ Y?

It is but it’s usually not a problem in statistical applications. If X is almost rank deficient then you’ll have a lot of uncertainty about the parameter estimates. The amplified noise from the squared condition number will be nothing compared to the statistical uncertainty. In some applications, you expect a perfect fit and in that case it can matter if you use QR or the Cholesky.

Yifan_Liu · November 26, 2018, 4:07pm

Econometrics 101 is a must-have!

BLI · November 26, 2018, 4:28pm

Thanks for tip.

Question: in some algorithms (“chemometrics”, etc.), it is common to use PCR (or PLS). PCR essentially chooses the number of latent variables based on the condition number. SVD is convenient because one can play around with \sigma_{i-1}/\sigma_i and find a suitable cut. With a “rank revealing” QR factorization where the diagonal elements of R are sorted in descending order, a similar idea could be used. But the standard QR factorization doesn’t contain a sorting of the diagonal elements of R, right?

BLI · November 26, 2018, 4:30pm

That is, of course, a valid point .

Juan · November 26, 2018, 4:42pm

This article shows you the equations but doesn’t discuss about the proper numerical methods (such as SVD, QR, Cholesky or sparse matrix methods…) to solve them.

Nosferican · November 26, 2018, 5:54pm

That is correct. You can use QR factorization for this,

invperm(F.jpvt)

Gives you the sorted column ranking.

Topic		Replies	Views
Accelerating linear methods Machine Learning question	11	689	June 3, 2023
Is there any glmmTMB package for Julia? Statistics	15	2507	April 28, 2022
Any Julia's equivalent to R's packages mcgv or mixed-effects models larger than memory? Statistics	9	2944	November 19, 2018
Problems with simple linear model (R and Julia comparison) General Usage	3	92	March 21, 2025
GLM - inconsistency with R on ISLR dataset? Performance question , glm	2	555	January 30, 2022

GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels?

Related topics