As a long time R user I was always under the impression that SVD is superior to and eigenvalue decomposition of the correlation matrix (see the documentation for prcomp and princomp) in terms of numerical accuracy and computational speed. The implementation of PCA in MultivariateStats.jl uses SVD if the the number of dimensions is smaller than the number of observations.
What are the computational trade-offs of SVD and eigenvalue decomposition of the covariance matrix?
I am well aware about the relation between the two of them. My question was about the computational aspects, i.e. numerical accuracy and computation time.
In R:
?prcomp
says:
…
Details:
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using ‘eigen’
on the covariance matrix. This is generally the preferred method
for numerical accuracy.
…
SVD also gives the predictions of the training values for free. So I was always under the impression, that SVD is superior. Except for very large data, because it is easy to construct a covariance matrix incrementally.
In exact arithmetic (no rounding errors etc), the SVD of A is equivalent to computing the eigenvalues and eigenvectors of AᵀA. However, computing the “covariance” matrix AᵀA squares the condition number, i.e. it doubles the number of digits that you lose to roundoff errors. Of course, if you start with 15 digits and your matrix has a condition number of 10², so that you are only losing about 2 digits, then squaring the condition number to 10⁴ still leaves you with 10+ digits. But if your matrix is badly conditioned then you could have a big problem, especially if you have inaccurate data to start with. That’s why SVD methods typically work by bidiagonalization or similar methods that avoid forming AᵀA explicitly.
However, for PCA my understanding is that there is a saving grace, because in PCA you are usually only interested in a few largest singular values. In particular (covered in e.g. lecture 31 of Trefethen and Bau Numerical Linear Algebra), computing the k-th singular value σₖ via AᵀA increases the errors (compared to bidiagonalization-based SVD) by a factor of ≈ σ₁/σₖ where σ₁ is the largest singular value. In consequence, as long as you are only interested in singular values and vectors close in magnitude to σ₁, as is typical in PCA, then working with the covariance matrix is not so bad.