Multi-Threading with GPU

Attempting to create a program that gets the maximum number of threads available for a system GPU ( CUDA ) and runs a function per thread.
First the program gets the total number of available threads and adds a slight buffer zone of about 10%. Then a matrix (image) is opened and parsed through the for loop shown.

for y in 1:height, x in 1:width
    # limit the number of threads 
    # to the largest value the GPU can maintain
    @cuda threads = process(x,y)

Whats the best method to avoid hitting the thread cap, and pulling the information of said thread cap. The device in question does have a capability of 7.0, so threads per multiprocessor should be 2048.

In addition, these images are fairly large which is where I was running into issues. The largest so far is 7000x7000 pixels.

If there is a better way to utilize the GPU while doing these small processes on each pixel, let me know. A solution with less CPU utilization and more GPU would be ideal.

Currently the program exits with a ‘killed’ error

Its not totally clear from the description but if your operation is just independently operating on each pixel in the image you can do,

broadcast(cu(img), 1:height, (1:width)') do val, y, x
    # do anything with val, x, and y
    # or even any index arithmetic like img[x,y+1] etc.. 
    # in which case just make sure img is a const global or inside a function

and the threads are chosen for you automatically.

The limit also depends on your kernel’s register and shared memory usage. It’s advised to use the occupancy API; search here for similar questions. If you want to continue this path, you can query the device limit using attribute(dev, DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK) and the actual limit by compiler your kernel using @cuda launch=false and introspecting it using CUDA.maxthreads.

Made some edits, but here is where I am at now.
Im trying to find a way to stably run this without hitting the memory cap. I think I need some sort of wait function in place, but every version of that I have tried has failed.

using CUDA, Images, FileIO

#2560Cores and 40 SM (Nvidia RTX 2070S)
maximum_threads = dev_thread*30 #total of 40 blocks, only going to use 30

function graphical_processing(path_image::String)
    #loads up a PNG, JPEG, etc
    img = load(path_image)

function read_image(img)

    (height, width) = size(img)

    CUDA.@sync begin #shouuld i add threads=dev_thread, blocks=value
        for y in 1:height, x in 1:width
            #potentially an if statement with a wait could go here so that if all threads are in use, the system doesnt get overloaded.
                CUDA.@async begin
                    read_pixel(img[y, x])
            catch error
                #This error has never been reached, instead a "killed" message is posted

function read_pixel(pixel)
    #Currently, this does nothing with R G B, but later it will use it to copy the values over to a new image. 
    r = bitstring(red(pixel))
    g = bitstring(green(pixel))
    b = bitstring(blue(pixel))
    return nothing