[ANN] FastRandPCA - Fast Randomized PCA for Sparse Data

After struggling to find a working Julia implementation of PCA for sparse matrices, I found one in Matlab (frPCA_sparse) and translated it to Julia: FastRandPCA.jl. On my test it works pretty fast: 186s for a 500 x 10_000_000 matrix.
The implementation requires the data to be rectangular (i.e., one dimension significantly smaller than the other), and the results are stochastic (as you can guess from the name).

That’s my first public package, so any feedback is appreciated :slight_smile:

4 Likes

Thanks for your efforts. Here are a few comments:

In general, you seem to be using very restrictive type declarations. e.g.

function eigSVD(A::Union{SparseMatrixCSC{T}, Matrix{T}}, k::Int=-1)::SVDRes where {T<:Real}

could just be

function eigSVD(A::AbsractMatrix{<:Real}, k::Integer=-1)

so that it accepts any matrix type, or even <:Number since it’s not clear that why numbers have to be real. And return-type declarations are rarely needed.

And local type declarations like d::Vector{Float64} = sqrt.(λ) are usually a bad idea. (Why force the result to be Float64, or a Vector?)

Such type declarations have no impact on performance, since Julia automatically infers the types anyway for type-stable code. See also the section argument-type declarations in the Julia manual.

This seems like an odd type to use for a return value:

SVDRes = NamedTuple{(:projection, :values, :vectors), Tuple{Matrix{Float64}, Vector{Float64}, Matrix{Float64}}}

Why not declare an actual named struct? Or better yet, just use LinearAlgebra.SVD(U, σ, V) so that you match the return type of the LinearAlgebra.svd function?

   if size(A, 1) > size(A, 2)
        A = A'
        trans = true
    end

This makes the function non-typestable. Better to do something like:

 if size(A, 1) > size(A, 2)
    F = pca(A')
    return SVD(F.Vt', F.values, F.U') # transpose
end

Note also that eigSVD, since it computes the SVD by eigenvalues of A'*A', squares the condition number of A. This is often a bad idea. It’s better to call svd(A) directly in the dense case, or to use an iterative algorithm specifically for the SVD such as svdsolve in KrylovKit.jl.

3 Likes

Thank you so much for the detailed feedback!

that it accepts any matrix type, or even <:Number since it’s not clear that why numbers have to be real. And return-type declarations are rarely needed.

Somehow I thought that using abstract types in parameters would lead to type instability. But looks like @code_warntype is fine with it, so I’ll change it to AbstractMatrix{<:Number}. Though I don’t think it will work with complex numbers, to be honest.

Why not declare an actual named struct? Or better yet, just use LinearAlgebra.SVD(U, σ, V) so that you match the return type of the LinearAlgebra.svd function?

SVD is a great idea, thank you! I’ll totally use it in the next release. As for the struct, I preferred NamedTuple because it can be unpacked. Though I see that LinearAlgebra.SVD somehow managed to support that, as well.

This makes the function non-typestable. Better to do something like:

Hm, I don’t see why it’s the case, and @code_warntype also doesn’t highlight any problems. Anyway, I agree that your solution is more elegant, so I updated the code :slight_smile:

Note also that eigSVD, since it computes the SVD by eigenvalues of A'*A', squares the condition number of A. This is often a bad idea. It’s better to call svd(A) directly in the dense case, or to use an iterative algorithm specifically for the SVD such as svdsolve in KrylovKit.jl .

Well, the code is designed for sparse matrices, and I’d say it’s just a side-effect that it also work for dense ones. I think it’s up to the user to use another solver if they see it fit. And I particularly don’t want to include any heavy dependencies, such as KrylovKit.

Even for sparse matrices, A^T A squares the condition number. I’m not sure why you consider KrylovKit a “heavy dependency” and not IterativeSolvers.jl?

Also, since you computing the SVD of A^T Q, which is presumably a small matrix even if A is large and sparse (it had better be, since it is dense because your Q is dense!), wouldn’t you just use the dense SVD on this?

How does your function compare to calling KrylovKit.svdsolve directly on A?

Note also that your use of randn means that you will automatically promote to Float64. Better to use randn(eltype(A), n, k+s) so that it uses the type of A. (For example, if A is single-precision then you want to use single precision. In general, the way to write type-generic code in Julia is to use the types of the arguments to determine the types used for intermediate computations and results.)

Also, since you computing the SVD of ATQ , which is presumably a small matrix even if A is large and sparse, wouldn’t you just use the dense SVD on this?

The whole idea came from the RSVDPACK paper, chapter 3.4. The claim is that for very large matrices, eigen(A' * A) is much faster than svd(A). And this is easy to check:

A = randn(100000, 1000);
@time svd(A);
@time eigen(A' * A);

20.607551 seconds (12 allocations: 1.527 GiB, 0.16% gc time)
2.657308 seconds (17 allocations: 30.877 MiB)

And it gets worse for larger matrices.

As for the conditioning problem that’s what they write:

A disadvantage of using this method is that the matrix BB^∗ has essentially the square of the condition number of B. As a result, very small singular values of A near machine precision may not be properly resolved. This is an issue only if A is expected to have very small singular values amongst σ_1, . . . , σ_k.

That’s a bit concerning, but at the end, svd has really bad asymptotic from the matrix size, and we need to run it on A' * Q, which has low number of columns, but as many rows as A. Given that I personally need the method to work with matrices of >10M rows, using normal svd is not an option…


Even for sparse matrices, ATA squares the condition number. I’m not sure why you consider KrylovKit a “heavy dependency” and not IterativeSolvers.jl?

IterativeSolvers just doesn’t have any real dependencies, and KrylovKit has more of them, including GPUArraysCore. Though it’s a minor thing.

How does your function compare to calling KrylovKit.svdsolve directly on A ?

On the synthetic example A = randn(100000, 1000) for 50 top values, eigSVD returns the same eigenvalues as svdsolve or eigsolve(A' * A, ...), but eigSVD takes 2.5s, eigsolve 1.7s, while svdsolve takes 74s. Surprisingly, svdsolve is much slower than full svd.
However, on these tests I observed that IterativeSolvers.lobpcg sometimes fails, while eigsolve behaves much stabler. So I guess I’ll switch to KrylovKit.eigsolve at the end.


Note also that your use of randn means that you will automatically promote to Float64. Better to use randn(eltype(A), n, k+s) so that it uses the type of A.

While I see how it helps for consistency, I’m not sure that would be correct in this case. When we do multiplication of large matrices, it’s a lot of sums inside, which can lead to overflowing. As an example, below values in A are in the range of 0-1000, Q are standard normal, but A' * Q consists fully of Inf. And if we use low-precision floats to save memory, type promotion makes total sense to me: A is large, so it’s Float16, but Q and A' * Q are much smaller, so we can afford to make them Float64. Same logic as with sparsity.

A = rand(Float16, 100000, 200) .* 1000;
Q = randn(eltype(A), size(A, 1), 10);
(A' * Q)[1:2, 1:5]
2×5 Matrix{Float16}:
 Inf  Inf  Inf  -Inf   Inf
 Inf  Inf  Inf  -Inf  -Inf

That’s not surprising. It’s basically the same as the reason the normal equations A^T Ax = A^Tb are the fastest way to solve least-squares problems, but people use QR or SVD instead to avoid squaring the condition number. See e.g. Efficient way of doing linear regression - #33 by stevengj and [ANN] LinearRegressionKit - #10 by stevengj

But, as you say, if one is only computing only a few of the biggest singular values then maybe this isn’t as big of an issue? Especially if your accuracy needs are low.

That shouldn’t be surprising — iterative methods are mostly advantageous for sparse problems (or other problems with fast matrix–vector multiply), not for dense matrices.

(Note that you should also be careful to compare the accuracies when comparing randomized SVD to something like svdsolve, since otherwise you may be comparing apples and oranges.)

It’s reasonable to promote to a higher precision in some cases (especially Float16), but what if the user gives you quad-precision data? Then you are downgrading the precision. (And if the data is complex it is completely wrong.)

For example, if you wanted to promote to at least single precision, you could do promote_type(Float32, eltype(A)).

(Float16 is both slow and limited range (< 10^5), However, Float32 is also pretty useful for computation (it doesn’t overflow until 3.4e38) and fast, and it’s a shame not to allow it.)

1 Like