GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels?


#21

I’m not a numeric analyst, but won’t the condition number of (X’ * X) \ (X’ * Y) be the square of that of X \ Y?

It is but it’s usually not a problem in statistical applications. If X is almost rank deficient then you’ll have a lot of uncertainty about the parameter estimates. The amplified noise from the squared condition number will be nothing compared to the statistical uncertainty. In some applications, you expect a perfect fit and in that case it can matter if you use QR or the Cholesky.


#22

Econometrics 101 is a must-have!


#23

Thanks for tip.

Question: in some algorithms (“chemometrics”, etc.), it is common to use PCR (or PLS). PCR essentially chooses the number of latent variables based on the condition number. SVD is convenient because one can play around with \sigma_{i-1}/\sigma_i and find a suitable cut. With a “rank revealing” QR factorization where the diagonal elements of R are sorted in descending order, a similar idea could be used. But the standard QR factorization doesn’t contain a sorting of the diagonal elements of R, right?


#24

That is, of course, a valid point :slight_smile: .


#25

This article shows you the equations but doesn’t discuss about the proper numerical methods (such as SVD, QR, Cholesky or sparse matrix methods…) to solve them.


#26

That is correct. You can use QR factorization for this,

invperm(F.jpvt)

Gives you the sorted column ranking.