CuArrays with KrylovKit

What is the recommended way of diagonalizing linear operators on CuArrays using KrylovKit? I am targeting the first 100 eigenvalues of a large matrix - Operating directly on CuArrays leads to a very large allocation on the GPU. I do this the following way:

    function gpu_wrapper!(y_d::CuArray{Tcalc}, x_d::CuArray{Tcalc})
        CUDA.@sync begin
            CUDA.@cuda threads=num_threads blocks=num_blocks two_body_operator_action_gpu!(y_d, x_d, creation_ops_d, creation_op_phase_helpers_d, annhilation_ops_d, annhilation_phase_helpers_d, coulomb_matrix_elements_d, lll_basis_d)
        end
        return
    end

I wrap this function using LinearMaps and pass it to Krylovkit’s eigensolve with an initial guess provided as a CuArray. (I set the krylodim to 110) - The memory usage on the GPU went beyond about 36GB before I shut the process down. However, if I do something like below:

    function gpu_wrapper!(y::Vector{Tcalc}, x::Vector{Tcalc})
        x_d = CuArray{Tcalc}(undef, size(lll_basis,1))
        y_d = CuArray{Tcalc}(undef, size(lll_basis,1))
        copy!(x_d, x)
        CUDA.@sync begin
            CUDA.@cuda threads=num_threads blocks=num_blocks two_body_operator_action_gpu!(y_d, x_d, creation_ops_d, creation_op_phase_helpers_d, annhilation_ops_d, annhilation_phase_helpers_d, coulomb_matrix_elements_d, lll_basis_d)
        end
        copy!(y, y_d)
        return
    end

The memory usage on the GPU is fairly contained ~ its about 2.5-3.5 GB.

The only problem is, as you may expect, this implementation is slower than the earlier one - for the exact operator I am considering, the time spent copying the array to the device is about 1/4 the total time.

What is the size/length of the vectors involved? And what is Tcalc: single or double precision, real or complex?

The dimension of the vectors is 7142568 ~ 10^7. Tcalc is Float32.

I think that this memory usage can then be expected: the memory size of 110 vectors of length 7142568 with Float32 numbers is 3142729920 bytes, or thus about 3 GB.

Note that KrylovKit stores the full Krylov subspace, even with the Lanczos algorithm (actually, Lanczos is merely the name used for the orthogonalisation process in building the Krylov subspace, the actual eigenvalue algorithm being used is rather called Krylov-Schur). There is some basic support for building a Krylov subspace where the older vectors are discarded, but this is not really being used in the eigsolve routine.

That was too quick, I now see that you say 36 GB instead of 3 GB. Is that the total amount of allocations being performed, or really the memory being used at a given point in the proces?

That’s the total (I read it off from Nvidia-smi).

I think that this memory usage can then be expected: the memory size of 110 vectors of length 7142568 with Float32 numbers is 3142729920 bytes, or thus about 3 GB.
This was my estimate as well.

I checked the memory usage of KrylovKit’s eigsolve alone for a vector of similar dimension(7040196), with 30 Krylov subspace vectors (the default) and solving for one eigenvalue using CUDA.@time - It was about 8 GB. With 5 Krylov subspace vectors, it was about 20 GB.

Also, the memory usage by Julia remains at 20 GB even after eigsolve has been called. I run the program under a let block as:

let
   
    num_particles = 14
    Lztotal = 0//1
    Q = (num_particles//2-2)//2 + (num_particles-1)
    num_eigvals = 1
    
    ham = generate_hamiltonian(num_particles, Lztotal, Q)
    x0 = randn(Tcalc, size(ham, 1))
    x0 ./= sqrt(x0'*x0)
    x0_d = CuArray(x0)
#     KrylovKit.eigsolve(ham, x0_d, num_eigvals, :SR; tol=1e-8, issymmetric=true, krylovdim=max(30, num_eigvals + 10), maxiter=1000)
    CUDA.@profile CUDA.@time KrylovKit.eigsolve(ham, x0_d, num_eigvals, :SR; tol=1e-8, issymmetric=true, krylovdim=5, maxiter=1000)
end

This is the output from CUDA.@time for 5 krylov subspace vectors:
233.890395 seconds (46.51 k CPU allocations: 2.661 MiB) (743 GPU allocations: 19.486 GiB, 0.00% memmgmt time)

Out[51]:

Maybe there’s something in my code? I am pasting it here just for reference:

function two_body_operator_action_gpu!(y, x, creation_ops, creation_op_phase_helpers, annhilation_ops, annhilation_phase_helpers, two_body_op_matrix_elements, lll_basis)
    
    index = (blockIdx().x - 1) *blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    
    i = index
    while (i <= length(y))
#         CUDA.@cuprintln(i)
        @inbounds y[i] = zero(Tcalc)
        @inbounds fock_state = lll_basis[i]
        op_iter = 1
        while op_iter <= length(annhilation_ops)
            @inbounds annhilated_state = ~annhilation_ops[op_iter] & fock_state
            @inbounds if isequal(annhilation_ops[op_iter], annhilation_ops[op_iter] & fock_state) && isequal(creation_ops[op_iter], creation_ops[op_iter] & (~annhilated_state))
                @inbounds y[i] += x[searchsortedfirst(lll_basis, annhilated_state | creation_ops[op_iter])] * two_body_op_matrix_elements[op_iter] * (-1)^(count_ones(annhilation_phase_helpers[op_iter] & fock_state) + count_ones(creation_op_phase_helpers[op_iter] & annhilated_state))
            end
            
            op_iter += 1
        end
        i += stride
    end
    return    
end

lll_basis is of the same dimension (and type UInt32) as the vectors passed to eigsolve. The other vectors are orders of magnitude smaller.

I am still confused by what your numbers mean. For example, the total allocations reported by @time (definitely Base.@time, I assume also for CUDA.@time) is the total memory that was requested over the whole duration of that block.

One particular aspect of KrylovKit is that (currently) it allocates new vectors for the Krylov subspace over the different iterations, rather then recycling the old ones. This is also clear from the fact that function that implements the lineair operator is to be given as a function with a single argument (the input) and that returns the new output. So actually this is the function that allocates the new vector; the further orthonormalisation of this vector with respect to the current basis of the Krylov subspace happens in-place and should not cause new allocations.

But this means that the total number of allocations scales roughly as “memory of a single vector * krylovdim * numiter” (not exactly, as not the whole Krylov subspace is rebuilt for each iteration).

However, the vectors from a previous iteration that are discarded should be cleaned up by the GC, and the memory required at a given point in time should not exceed “memory of a single vector * krylovdim”. However, it depends on how aggressive the GC cleans up the old vectors how much the instantaneous memory usage of Julia will be.

I am not sure if this answers your question?

PS: Having the possibility to recylce the vectors (which thus requires the linear operator to be specified as a two-argument function where the first argument is the output vector to which the result needs to be written) is on the to do list. I hope to get to it somewhere in the near future.

2 Likes

That does answer my question.