In this notebook I describe evaluating the rank of a model matrix X by forming a pivoted Cholesky factorization of X'X. I use this form because the X'X matrix will be evaluated whether or not the rank is needed.
I would appreciate if those who know the numerical linear algebra involved could tell me if my discussion of the tolerance is reasonable.
As explained in the documentation for dpstrf, this value [tol] should be negated
to be used as a relative tolerance. Otherwise it is used as an absolute
tolerance which usually is not what you want.
This is not quite right. If tol is less than zero, dpstrf uses the recommended default value n*eps()*maximum(diag(A)), where A is n \times n, (which is relative, but not scaled by tol). Experiments showed that this yields stable factorization and accurate rank estimates for tested matrices where the condition number of the positive-definite projection is less than about 1/(n*eps()). If the condition is worse, both factors and rank estimate are questionable IIUC.
Thus your suggestion of \sqrt \epsilon is perhaps too conservative.
Usually one cares more about identifying linear dependent variables in the model matrix rather than just the rank. This way, if there are linearly dependent variables, these can be dropped such that one obtains a model matrix that is full rank. If the approach to solving the problem is through inversion of the normal matrix, I believe the best method would be.
Compute the QR decomposition of X.
The diagonal elements of R identify linearly independent variables as non-zero elements and linearly dependent as those with value zero.
Then one can compute the Cholesky decomposition of the normal matrix… For calculating the Gram matrix one can use R^{\top}R.
The QR decomposition is a preferred approach to singular-values decomposition (which rank uses) and has can be tailored to have similar tolerances.
Edit:
Actually I was thinking from a perspective of using iterative solvers. If you are using the Cholesky decomposition for solving the problem, then it is better not to do the QR decomposition as it is \mathcal{O}\left(m n^{2}\right) rather than Cholesky which is \mathcal{O}\left(n^{3}\right). The R in the QR decomposition is the same as the upper triangular of the Cholesky factor from the Gram matrix. You can identify which columns are linearly independent checking which diagonal elements of the L are non-zero.
As an aside, I still need to verify if the R approach from the QR decomposition can be linked to the singular-values decomposition for calculating the effective degrees of freedom, (i.e., \text{Tr}\left[ X \left(X^{\top}X + \Gamma\right)^{-1} X^{\top}\right])
This approach can be somewhat brittle (the estimates change discontinuously as a function of the data). In practice, using a vague but proper prior or hierarchical/multilevel models is more robust.
I use it mostly for instrumental variable estimators such as 2SLS or after transformations such as fixed effects. In those cases, it makes sense for the user to specify a model which for some estimations the model matrix has to be trimmed to a subset which is full rank.
@Ralph_Smith Thank you for the correction on my reading of the documentation.
The reason that I chose the square root of eps() is because the positive semidefinite matrix is constructed as X'X and it is rank deficiency of X that I want to detect. This also related to the next comment by @Nosferican
@Tamas_Papp I agree that the use of hierarchical/multilevel models or, more generally, mixed-effects models is more robust. This calculation is being performed within the MixedModels package whose purpose is to fit exactly those models.
I have been contacted by users who have some factors they are modeling as fixed effects terms that may also be rank deficient.