CUDA.jl - Variable Sized Local Arrays Inside CUDA Kernel

Hello all,

I am trying to parallelize a computational mechanics code. I encountered a problem which I cannot solve. I am trying to allocate variable sized arrays for every thread. However, for a particular thread, the size of the arrays are not the same. For example, thread #1 can be dealing with an array size of (4, 16) and thread #16 can be dealing with (4, 26). In this situation “4” is not arbitrary. It is square of 2, which is the dimension of the problem. It is also a variable. There is a simple code to demonstrate this hypothetical situation. This kernel can be wrong and it is not working but it conveys the message.

clearconsole()

# Import related libraries
using CUDA
using StaticArrays


# Random array kernel
function gpu_arbitrary_assemblage(sum_vector, sizes_of_arrays)
    tidx = (blockIdx().x - 1) * blockDim().x + threadIdx().x #calculate thread ID number
    PD = 2 # this value is an acronym for problem dimension. It can only be 2 or 3.
    if tidx <= 10   

        # Get how many columns to initiate. This information actually comes from a real data.
        # I dont know which INT value to use. I look for solutions.
        # This "columns" value actually comes from size(matrix, 2)
        columns = UInt32(sizes_of_arrays[tidx])

        
        random_array_for_thread = MMatrix{PD*PD, columns, Float32}(undef)

        #=
        It really does not matter how I initiate the variable sized arrays
        I could use something like this (please read below paragraph)
        random_array_for_thread = @MMatrix zeros(PD*PD, columns)
        =#
        
        ############################################
        #     some operations on arrays etc.
        #     calculations with long lines
        ############################################

        #=
        this operation is totally arbitrary. for example we sum all the values inside array
        and we append them to a result vector
        =#
        sum_vector[tidx] = sum(random_array_for_thread) 
    end
    return nothing
end

# random sizes, these values are column values.
sizes_of_arrays = cu([6, 8, 5, 4, 7, 2, 5, 4, 9, 8, 7])

# this is our result vector
sum_vector = CUDA.fill(0.0f0, (10))

display(sum_vector)
@cuda threads = (32, 1, 1) gpu_arbitrary_assemblage(sum_vector, sizes_of_arrays)
display(sum_vector)

If I try to allocate memory for a thread to use with variables, I get an error. The error is named “Reason: unsupported call to an unknown function (call to jl_f_apply_type)”.

Actually, I learnt to use MMatrix and MVector from StaticArrays library from this forum. It is important arrays to be mutable for me. If I wrote this code in CUDA Python, I would use “cuda.local.array((4,9), float32))” function to create arrays. However, it is not possible to create variable sized arrays in CUDA Python. I hope this is not the case in CUDA Julia. I am also looking for a better option to create arrays. Maybe something like “CUDA.CuLocalArray()” but it does not exist.

My question is that is it possible to create variable sized (dynamically sized) arrays for every thread? Because all of them are different from each other, I cannot give a value to kernel parameters to be used. This would be incredibly inefficient in my program. I need something special for one particular thread, not for all threads. If anybody can help, I would be so glad. Thank you.

You will need to create the variable sized arrays prior to launching your kernel, and pass them in as a parameter.

Thank you so much. If it is the only way, I will try to construct my code accordingly.