Attempting to create a program that gets the maximum number of threads available for a system GPU ( CUDA ) and runs a function per thread.
First the program gets the total number of available threads and adds a slight buffer zone of about 10%. Then a matrix (image) is opened and parsed through the for loop shown.
for y in 1:height, x in 1:width
# limit the number of threads
# to the largest value the GPU can maintain
@cuda threads = process(x,y)
end
Whats the best method to avoid hitting the thread cap, and pulling the information of said thread cap. The device in question does have a capability of 7.0, so threads per multiprocessor should be 2048.
In addition, these images are fairly large which is where I was running into issues. The largest so far is 7000x7000 pixels.
If there is a better way to utilize the GPU while doing these small processes on each pixel, let me know. A solution with less CPU utilization and more GPU would be ideal.
Its not totally clear from the description but if your operation is just independently operating on each pixel in the image you can do,
broadcast(cu(img), 1:height, (1:width)') do val, y, x
# do anything with val, x, and y
# or even any index arithmetic like img[x,y+1] etc..
# in which case just make sure img is a const global or inside a function
end
The limit also depends on your kernel’s register and shared memory usage. It’s advised to use the occupancy API; search here for similar questions. If you want to continue this path, you can query the device limit using attribute(dev, DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK) and the actual limit by compiler your kernel using @cuda launch=false and introspecting it using CUDA.maxthreads.
Made some edits, but here is where I am at now.
Im trying to find a way to stably run this without hitting the memory cap. I think I need some sort of wait function in place, but every version of that I have tried has failed.
using CUDA, Images, FileIO
#2560Cores and 40 SM (Nvidia RTX 2070S)
dev_thread = CUDA.attribute(CUDA.device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
maximum_threads = dev_thread*30 #total of 40 blocks, only going to use 30
function graphical_processing(path_image::String)
#loads up a PNG, JPEG, etc
img = load(path_image)
read_image(img)
end
function read_image(img)
(height, width) = size(img)
CUDA.@sync begin #shouuld i add threads=dev_thread, blocks=value
for y in 1:height, x in 1:width
#potentially an if statement with a wait could go here so that if all threads are in use, the system doesnt get overloaded.
try
CUDA.@async begin
read_pixel(img[y, x])
CUDA.reclaim()
end
catch error
#This error has never been reached, instead a "killed" message is posted
println(CUDA.memory_status())
println(error)
end
end
end
end
function read_pixel(pixel)
#Currently, this does nothing with R G B, but later it will use it to copy the values over to a new image.
r = bitstring(red(pixel))
g = bitstring(green(pixel))
b = bitstring(blue(pixel))
return nothing
end