Problem
I am trying to compute a linear regression model that predicts overall delay of flights with respect to distance, carrier and some other predictors. I decided to use SweepOperator.jl because large number of observations (more than 40 million).
However, when I formed the Gram matrix, it is not positive definite (the snapshot of its eigenvalues are shown below), which is the case because of the numerical reasons. Applying sweep operator to this matrix will result in a lot of NaN
in the estimated \hat{\beta}.
I do not know what I can do to make it positive definite to that it is pd (with some guarantee).
What I Have Done
I choose first 1 million observations and form a Gram matrix, which is also non-pd.
I (somehow) add an identity matrix to it and (somehow) make it pd. Since the entries of Gram matrix is extremely large, this seems to solve the problem. In fact, when I compared the results given by sweep!()
and sklearn.linear_model.LinearRegression()
, they are essentially the same.
However, I am not sure if this is a generally acceptable way and whether there is any rationale behind this.