Convert a (numerically) non-pd matrix into pd matrix

Mr.Robot · May 10, 2019, 4:44am

Problem

I am trying to compute a linear regression model that predicts overall delay of flights with respect to distance, carrier and some other predictors. I decided to use SweepOperator.jl because large number of observations (more than 40 million).

However, when I formed the Gram matrix, it is not positive definite (the snapshot of its eigenvalues are shown below), which is the case because of the numerical reasons. Applying sweep operator to this matrix will result in a lot of NaN in the estimated \hat{\beta}.

I do not know what I can do to make it positive definite to that it is pd (with some guarantee).

What I Have Done

I choose first 1 million observations and form a Gram matrix, which is also non-pd.

I (somehow) add an identity matrix to it and (somehow) make it pd. Since the entries of Gram matrix is extremely large, this seems to solve the problem. In fact, when I compared the results given by sweep!() and sklearn.linear_model.LinearRegression(), they are essentially the same.

However, I am not sure if this is a generally acceptable way and whether there is any rationale behind this.

Tamas_Papp · May 10, 2019, 7:57am

AFAIK there are specific algorithms for online updating of OLS that are fairly robust, eg

@article{gentleman1974algorithm,
  title={Algorithm as 75: Basic procedures for large, sparse or weighted linear least problems},
  author={Gentleman, W Morven},
  journal={Journal of the Royal Statistical Society. Series C (Applied Statistics)},
  volume=23,
  number=3,
  pages={448--454},
  year=1974,
  publisher={JSTOR}
}

pistacliffcho · May 10, 2019, 8:49am

In fact, what you have done is essentially recreated ridge regression, although it’s usually presented as lambda * identity matrix, with the parameter lambda usually chosen by something like cross-validation. This creates a downward bias in the estimated coefficients but often provides better out of sample estimates, especially for high dimensional problems.

Topic		Replies	Views
[ANN] LinearRegressionKit Package Announcements package , regression , linear-regression	9	2668	November 22, 2022
What can I do in case Gram matrix is not positive definite? General Usage question	4	1063	October 6, 2020
Near positive definite (NearPD) of a symmetric matrix in Julia Performance	9	2826	October 15, 2024
PosDefException, increasing loss of precision General Usage	4	711	March 13, 2019
Numerical inaccuracies when constructing correlation matrices General Usage linearalgebra	6	417	August 2, 2022

Convert a (numerically) non-pd matrix into pd matrix

Problem

What I Have Done

Related topics