SVD on a matrix that does not fit into memory

I am looking for a function that makes a SVD on a matrix that does not fit into memory. I am interested in using just one computer, although an answer for a cluster can be useful for other users.

The python dask package is able to do this in two ways (methods):

  • divide an conquer (for Tall-and-Skinny matrices)
  • compressed svd (for all kinds of matrices)

So far, I have seen the YAXarray.jl library, but it does not implement SVD. It seems to be able to make operations on matrices that do not fit into memory, but those operations seem to be limited to the application of a function on every element of the matrix.

I have also tested the Mmap.jl library with HDF5, and applied the svd() from LinearAlgebra.jl, but this seems to internally create a copy of the matrix in RAM. Even svd!() can internally create matrices in RAM.

I have read a bit of Dagger.jl, but it also seems to not implement this either.

So, can I achieve my purpose by the use and combination of existing Julia libraries, or has this to be implemented?. In the latter case, where should I make the request?

1 Like

This is not the full SVD, but rather a rank-k approximation given by the k largest singular values.

If your desired k is small enough that k vectors do fit into memory (even though the whole matrix does not), then you could use e.g. KrylovKit.svdsolve or TSVD.tsvd, given only a function to multiply your large matrix by a vector.

To multiply a matrix that does not fit into memory, you need an out-of-core algorithm. I’m not sure if any are implemented in Julia already, but for matrix–vector products they would be pretty easy to implement yourself: just read the matrix into memory in chunks (e.g. chunks of rows), multiply each chunk by your vector, and then add up the results.

If you have a cluster, then you can use a distributed-memory linear-algebra package like Elemental.jl, or use an iterative package like KrylovKit or TSVD with DistributedArrays.jl.

(Nowadays, I think most people trying to solve this sort of problem would use a cluster or other parallel system rather than using out-of-core algorithms with filesystem storage.)

4 Likes

Indeed. Please see the README. There are several SVD examples. Both a full distributed SVD and a truncated SVD via TSVD.jl where Elemental is used for the linear algebra. (I haven’t run the examples in a long time so I hope they still work.)

3 Likes

If your matrix is tall, and A' * A fits in memory, then you could compute the eigenvalue composition of that instead. This sacrifices some precision, but it’s very easy to do. The matrix A should be stored row-wise, so that it is efficient to load a few rows at a time into memory. Since A' * A is symmetric, you can use the LinearAlgebra.BLAS.syrk! to compute only half of it.

2 Likes

Note that this method gives correct singular values, but not necessarily correct singular vectors. Example: if A is any matrix with orthonormal columns, then the method just returns eig(I).

Nope. Orthonormal eigenvectors of A^T A (for nonzero eigenvalues) are exactly the right singular vectors \hat{V} of the compact SVD. Once you have \hat{V}, then the left singular vectors are simply \hat{U} = A\hat{V}\hat{\Sigma}^{-1}

In your example, if A has orthonormal columns, correct singular vectors are indeed simply \hat{V} = I (with singular values = 1), and \hat{U} = A.

3 Likes

You are correct, my bad. I was mixing this up with the variant where one computes V = eigvecs(A'*A) and U = eigvecs(A*A') separately and the two are not guaranteed to match. The method suggested here instead works correctly. It is still less stable than the standard methods based on orthogonal transformations, but for such a large matrix this is unlikely to matter.

So, if I understand well, I can achieve my purpose by the use of, for example, KrylovKit.svdsolve and passing a matrix generated by Mmap.

I have tested this with a 3690 x 404640 array of Float32 (6GiB) in a NetCDF, accessed as a mmap array by means of HDF5.readmmap. I launched the calculation as:

using HDF5, KrylovKit
fid = h5open("nc_file.nc", "r")
array = HDF5.readmmap(fid["nc_var_name"])
# Make lons and lats a single dimension
array = reshape(array, (size(array)[1] * size(array)[2], size(array)[3]))
array = transpose(array) # (time x space). It was (space x time)
vals, lvecs, rvecs, info = KrylovKit.svdsolve(array, 100, krylovdim=200)

It took a while …, but only 1 GiB was used. So, I think this is great!.

Python dask is faster, but the memory consumption increases with the number of chunks. So if your matrix is very huge, you have to make a lot of chunks in order for each of them fit into memory. But as more chunks you have, more memory is consumed and the calculations may be not feasible.

So the use of Julia with KrylovKit is a solution when dask is no feasible.

P.D.: It seems not necessary to implement a mul function.

This is not ideal. The default matrix-multiplication algorithm was designed assuming that that the array is in RAM (it does some “blocking”, but oriented towards relatively small CPU caches). For arrays on disk, you really want to read the array in a large contiguous chunk into memory, use the built-in BLAS (in-memory) algorithm on this chunk, then read in the next chunk and add in the results.

(You could still implement this by mmapping the array, but then manually selecting chunks of this array with a @view to multiply one by one, adding up the results.)

1 Like

Well, I am relatively new to Julia, so, I have some questions:

  • Do I have to implement the *() function?.
  • If I do it, will the KrylovKit use it automatically?.
  • What algorithm should be apropriate?. I have found the Strassen’s algorithm, but do not know if there exists a more appropriate one.
  • Do you think it would be a good idea to implement it over YAXArrays?.
  • You don’t have to call it *. Just write a function mul_by_cols(A, x) to do the multiplication by chunks, which you can apply to your mmapped matrix A.

  • No, you pass it to KrylovKit. Instead of passing A, pass x -> mul_by_cols(A, x), to tell it to use your function with your matrix A

  • Don’t use Strassen. Just do ordinary matrix multiplication by loading in contiguous chunks of the matrix — chunks of columns since Julia matrices column-major — multiply them by the corresponding chunks of x, then add up the results.

  • Not sure why you are bringing YAXArrays into this. I would keep it simple — just mmap your matrix from the file to give you a Matrix{Float64} (assuming you are in double precision).

For example, this computes A*x by reads a matrix A in contiguous 1GiB chunks of columns (2^27 numbers in double precsion):

using LinearAlgebra
@views function mul_by_cols(A, x; chunk_size=2^27)
    m, n = size(A)
    nchunk = min(cld(chunk_size, m), n)
    y = A[:, 1:nchunk] * x[1:nchunk]
    j = nchunk + 1
    while j ≤ n
        jend = min(j+nchunk-1, n)
        mul!(y, A[:,j:jend], x[j:jend], 1, 1) # y += A[chunk] * x[chunk]
        j = jend + 1
    end
    return y
end
1 Like

Wow!. Thank you for such a superb job!
I will try it.

I am sorry , bu I am unable to find the correct arguments for svdsolve. I have tested a lot of combinations. For example:

vals, lvecs, rvecs, info = KrylovKit.svdsolve(x -> mul_by_cols(array, array[:,begin]), 100, krylovdim=200)
ERROR: LoadError: MethodError: no method matching (::var"#2#3")(::Vector{Float64}, ::Val{true})

vals, lvecs, rvecs, info = KrylovKit.svdsolve(x -> mul_by_cols(array, x), rand(eltype(array), size(array,1)), 100, krylovdim=200)
ERROR: LoadError: MethodError: no method matching (::var"#8#9")(::Vector{Float32}, ::Val{true})
Closest candidates are:
  (::var"#8#9")(::Any) 

vals, lvecs, rvecs, info = KrylovKit.svdsolve(x -> mul_by_cols(array, x), 100, krylovdim=200)
ERROR: LoadError: MethodError: no method matching (::var"#2#3")(::Vector{Float64}, ::Val{true})
Closest candidates are:
  (::var"#2#3")(::Any)

That’s because svdsolve, according to its documentation requires a function that computes either Ax or A^T x (A^*x for complex A), depending on a second argument.

i.e. you need to change mul_by_cols to:

using LinearAlgebra
@views function mul_by_cols(A, x, ::Val{adj}; chunk_size=2^27) where {adj}
    m, n = size(A)
    nchunk = min(cld(chunk_size, m), n)
    y = similar(x, typeof(zero(eltype(A)) * zero(eltype(x))), adj ? n : m)
    if adj
        mul!(y[1:nchunk], A[:, 1:nchunk]', x)
    else
        mul!(y, A[:, 1:nchunk], x[1:nchunk])
    end
    j = nchunk + 1
    while j ≤ n
        jend = min(j+nchunk-1, n)
        if adj
            mul!(y[j:jend], A[:,j:jend]', x) # y[chunk] = A[chunk]' * x
        else
            mul!(y, A[:,j:jend], x[j:jend], 1, 1) # y += A[chunk] * x[chunk]
        end
        j = jend + 1
    end
    return y
end

Then you can compute the largest 5 singular values with KrylovKit by:

svdsolve((x,adj) -> mul_by_cols(A, x, adj), size(A,1), 5)

It’s always helpful to read the manual to figure out precisely what arguments are expected!

2 Likes

Wow! Thaks again for the job!
I had to review a bit of the staff of adjoint theory and am curious about a detail in the implementation:
As I understand, the adjoint of a column vector is a row vector, so that the adjoint of A·x, where A is a matrix and x is a column vector, would be (primes denoting adjoint) x’·A’. So, would not the following code lines for the adjoint calculation:

mul!(y[1:nchunk], A[:, 1:nchunk]', x) # before the loop
mul!(y[j:jend], A[:,j:jend]', x) # in the loop

be

mul!(y, x[1:nchunk]', A[:, 1:nchunk]') # before the loop
mul!(y, x[j:jend]', A[:,j:jend]', 1, 1) # in the loop

?
And the result of x’·A’ is not a n-elements row vector, but a m-elements row vector, so that the line:

y = similar(x, typeof(zero(eltype(A)) * zero(eltype(x))), adj ? n : m)

would also be changed into

y = similar(x, typeof(zero(eltype(A)) * zero(eltype(x))), m)

But, when I make these changes, it produces the following error:

ERROR: LoadError: BoundsError: attempt to access 3690-element Vector{Float32} at index [1:36374]

So I think I am messing up with the theory.

For the SVD calculation, what you need is both x \mapsto Ax and y \mapsto A^* y operations, where A^* =\overline{A^T} is the adjoint (conjugate-transpose) operation (denoted A' in Julia).

In contrast, x^* A^* = (Ax)^* is just the conjugate-transpose (adjoint) of the vector Ax — there would be no need for KrylovKit to ask you to provide a separate linear operation for this, because it could just conjugate-transpose the result of x \mapsto Ax.

Philosophical note: In a more abstract sense, if you have an inner product \langle u, v \rangle (colloquially, a “dot product” u \cdot v, especially in the real case), then the adjoint of a linear operator A is the linear operation A^* such that \langle y, Ax \rangle = \langle A^* y, x \rangle for any x,y — it “moves” the linear operator A to the “other side” of a dot product. This is the reason why conjugate-transposition is an important operation for matrices, as opposed to some other rearrangement of the elements (like “rotate the matrix by 90°”), and it also allows you to extend adjoints to more general vector spaces equipped with an inner product.

1 Like

Thank you for the theoretical clarification. I still should dive into the theory in order to understand better, but your clarification is enough for now.

So, now I have a question about julia itself (this is becoming a very complete thread and I hope it will be useful for many people): what is the benefit of using similar when we provide the element type and the size?, is it not just the same as

Vector{typeof(zero(eltype(array)) * zero(eltype(x)))}(undef , adj ? n : m)

?

If the user has a custom array type, then similar may have a specialized method that returns something different from Array.

Well. So, now comes the performance gain of using the mul_by_cols function. Firstly, I have to mention that I have a ssd disk, so operations on memory mapped arrays are not penalized a lot.
And the computation time when not using the mul_by_cols funtion, but:

@time vals, lvecs, rvecs, info = KrylovKit.svdsolve(array, 100, krylovdim=200)

is less than 2 minutes:

96.338268 seconds (24.16 M allocations: 1.750 GiB, 0.74% gc time, 5.55% compilation time)

and

ConvergenceInfo: 118 converged values after 1 iterations and 400 applications of the linear map

(I requested 100 values). I could also see that all CPU cores were working, so I it is clear that a parallelization is implemented when a matrix, instead of a function, is passed to svdsolve. I have 6 cores with two threads each, but run julia with the option --treads 4, and 4 threads were 100% working.

On the other hand, the computation time when using the mul_by_cols is a bit more than 18 minutes:

1094.901849 seconds (16.19 M allocations: 1.193 GiB, 0.02% gc time, 0.59% compilation time)

and

ConvergenceInfo: 8 converged values after 2 iterations and 84 applications of the linear map;

(I requested 5 values). I could also see that only one CPU was working, so there is some magic when a matrix, instead of a function, is passed to svdsolve. This magic includes the parallelization of the multiplication, but I think there can be some other optimizations, because 2 minutes when requesting 100 values and 18 minutes when requesting 5, seems to me that is too much to be explained only by parallelization.

Interesting. Internally mul_by_cols should(?) be using BLAS multiplication, which should be multi-threaded. But maybe if you pass the raw mmapped matrix (so that it just calls mul! directly) then it gets greater benefit from parallelization since it parallelizes the whole matrix–vector product and not just individual chunks?