CUDA.jl - Memory Efficient Operations, Manipulations and Calculations on Large Sparse Arrays

Hello all,

I formerly created this topic, my problem is continuing until this day:

The problem is related to GPU RAM and its size. The global system is a square matrix, consisting of Float64 values. However, it is mostly sparse, at least %90 of the values are zero values. If I increase the simulation size, the degree of freedom of the system increases. The degree of freedom determines the global system matrix size. At some point, I instantly get a memory error. The GPU tells me that it is not possible to allocate more memory, it is already using %100 of the memory, which is a reasonable error. I am trying to solve this memory error. Currently, I am limited to a certain simulation size due to memory constraints.

I wanted to use a sparse array type. However, atomic addition to sparse arrays is not supported, I already created a topic about this. My question is that how can I overcome this issue? I could not find a way to solve this problem. CUDA.jl provided me with over 7000x speed-up with the program, the only missing part is related to the simulation size.

  1. I do not want to store all the values, my global system is mostly sparse. Which array type can I use? Is there another library compatible with CUDA threads to store giant matrices, etc.?
  2. Recording the data to a TXT file is a solution but as far as I understand this does not seem possible. Is @cuprintln(filename, “text”) supported like in the original println function? Also, this would disrupt the speed-up process but I can sacrifice some speed to be able to run larger simulations. Is there a way to output thread solutions to HDD or SSD to construct sparse arrays?
  3. Or any other solution that I cannot think of.

7000x speed-up is not a crucial thing, I already had a reasonable speed-up. The remaining problem is about GPU hardware, storage and simulation size. It is more important for me to run larger simulations than having a lightning-speed simulation. Running any size simulation with 1000x speed-up is much more valuable than restricted 7000x speed-up. Is there an intelligent way to store this kind of giant matrix minimally on GPU RAM or any other computer hardware? Thank you so much.

The problem isn’t that atomic addition isn’t supported, it’s that you wrote a kernel which expects to be able to index a sparse array directly. That’s not how sparse arrays work, they need to be ‘iterated’, where (depending on the sparse format) each thread will iterate a set of indices it will then process.

I’ve recently implemented broadcast for sparse arrays, see CUDA.jl/broadcast.jl at master · JuliaGPU/CUDA.jl · GitHub. Either you can try re-use the broadcast abstraction, or write your own kernel that processes the sparse array the way you want. In that case, you can probably reuse some of the existing infrastructure (like the CSR/CSC iterators), although they will probably need to be generalized a bit.

Sorry for the off-topic, but would it be possible to implement broadcast for CuDeviceArrays? It will allow to call from CUDA kernels external functions with broadcasting written for usual arrays.

Sorry I could not understand how I should use the iterators or broadcasting on sparse arrays.
How should I modify this code for it to work? I specifically need this type of code because it resembles the original assembly kernel although the original code is much more complex with dynamic parallelism.

# Import related libraries
using CUDA
using StaticArrays

# Assembly kernel
function gpu_arbitrary_assemblage(global_mat, indices_coor, indices_contribution)
    tidx = (blockIdx().x - 1) * blockDim().x + threadIdx().x #calculate thread ID number
    if tidx <= 10
        CUDA.@atomic global_mat[indices_coor[1, tidx], indices_coor[2, tidx]] += indices_contribution[tidx] # perform atomic addition to many array locations
    return nothing

global_mat = CUSPARSE.CuSparseMatrixCSC(CUDA.fill(0.0f0, (6, 6)))

# the indices, information for add which value to "where".
indices_coor = SMatrix{2, 10, Int32}([3 3 1 1 4 4 4 2 5 5;
                                      5 5 2 2 4 4 6 3 2 2])

indices_contribution = cu([0.6, 0.6, 0.7, 0.7, 0.8, 0.8, 1, 1, 0.9, 0.9]) #random float numbers to be added on global array

# display the initial global array

# check the type of the array

#cuda kernel
@cuda threads = (32, 1, 1) gpu_arbitrary_assemblage(global_mat, indices_coor, indices_contribution);

#show the result

Is it something possible? Thank you.

Not easily, that would be hard (e.g. can’t allocate an output container) and deceptive (because it would iterate on a single thread). Use StaticArrays if you want to broadcast in a kernel.

Sorry, I don’t have the time to adapt your code. Either way, there’s no trivial solution, because you can’t index sparse arrays without paying a huge performance penalty (the same applies on the CPU). Look into expressing this as a broadcast operation (which you call from the host), or look at how broadcast on sparse arrays is implemented and use those iterators from within a custom kernel.

Okay, I thought it was a work of few lines of code. If there is not a trivial solution, I guess I have to change the solution perspective from root. Thank you.