What is the recommended way of diagonalizing linear operators on CuArrays using KrylovKit? I am targeting the first 100 eigenvalues of a large matrix - Operating directly on CuArrays leads to a very large allocation on the GPU. I do this the following way:
    function gpu_wrapper!(y_d::CuArray{Tcalc}, x_d::CuArray{Tcalc})
        CUDA.@sync begin
            CUDA.@cuda threads=num_threads blocks=num_blocks two_body_operator_action_gpu!(y_d, x_d, creation_ops_d, creation_op_phase_helpers_d, annhilation_ops_d, annhilation_phase_helpers_d, coulomb_matrix_elements_d, lll_basis_d)
        end
        return
    end
I wrap this function using LinearMaps and pass it to Krylovkit’s eigensolve with an initial guess provided as a CuArray. (I set the krylodim to 110) - The memory usage on the GPU went beyond about 36GB before I shut the process down. However, if I do something like below:
    function gpu_wrapper!(y::Vector{Tcalc}, x::Vector{Tcalc})
        x_d = CuArray{Tcalc}(undef, size(lll_basis,1))
        y_d = CuArray{Tcalc}(undef, size(lll_basis,1))
        copy!(x_d, x)
        CUDA.@sync begin
            CUDA.@cuda threads=num_threads blocks=num_blocks two_body_operator_action_gpu!(y_d, x_d, creation_ops_d, creation_op_phase_helpers_d, annhilation_ops_d, annhilation_phase_helpers_d, coulomb_matrix_elements_d, lll_basis_d)
        end
        copy!(y, y_d)
        return
    end
The memory usage on the GPU is fairly contained ~ its about 2.5-3.5 GB.
The only problem is, as you may expect, this implementation is slower than the earlier one - for the exact operator I am considering, the time spent copying the array to the device is about 1/4 the total time.