Convert a (numerically) non-pd matrix into pd matrix


I am trying to compute a linear regression model that predicts overall delay of flights with respect to distance, carrier and some other predictors. I decided to use SweepOperator.jl because large number of observations (more than 40 million).

However, when I formed the Gram matrix, it is not positive definite (the snapshot of its eigenvalues are shown below), which is the case because of the numerical reasons. Applying sweep operator to this matrix will result in a lot of NaN in the estimated \hat{\beta}.

I do not know what I can do to make it positive definite to that it is pd (with some guarantee).

What I Have Done

I choose first 1 million observations and form a Gram matrix, which is also non-pd.

I (somehow) add an identity matrix to it and (somehow) make it pd. Since the entries of Gram matrix is extremely large, this seems to solve the problem. In fact, when I compared the results given by sweep!() and sklearn.linear_model.LinearRegression(), they are essentially the same.

However, I am not sure if this is a generally acceptable way and whether there is any rationale behind this.

AFAIK there are specific algorithms for online updating of OLS that are fairly robust, eg

  title={Algorithm as 75: Basic procedures for large, sparse or weighted linear least problems},
  author={Gentleman, W Morven},
  journal={Journal of the Royal Statistical Society. Series C (Applied Statistics)},

In fact, what you have done is essentially recreated ridge regression, although it’s usually presented as lambda * identity matrix, with the parameter lambda usually chosen by something like cross-validation. This creates a downward bias in the estimated coefficients but often provides better out of sample estimates, especially for high dimensional problems.