I have some small GPU program (finite element assembly of a heat problem) which I have written using a combination of CUDA and KernelAbstractions. The main kernel will be called repeatedly, so I want to preallocate the necessary buffers. Moving all data every kernel call to the GPU not an option, as there is significant data to move around, and I want these repeated calls to be as fast as possible. To allocate buffers which have the correct size, in my first attempt I have defined a helper struct
struct TaskDescriptor
device
num_workers
end
where num_workers is right now the number of GPU threads (i.e. the launch uses @cuda threads=num_workers f(...) ) and device is the KernelAbstractions device. Let us assume that the kernel should simply fill num_workers square matrices and is called repeatedly. I have not yet included the blocks to get things working in a first step. My questions are now:
- For this repeated kernel call, what is the recommended way to find a good number of threads and blocks in advance (and hence the stride for grid-stride loops)?
- Are there any design recommendations or examples regarding the setup logic for the GPU buffers? Let me elaborate on a specific example. Let us say I have a struct with its CPU constructor
MyCache(X)which is used in all tutorials to get the CPU caches up and running, then right now I add a dispatchMyCache(TaskDescriptor(CudaDevice(), num_workers), X)which does not returnMyCacheanymore, butMyGPUCache. I do not really like that design tho, because it feels confusing. I really want to make sure to have a good design here, because when I start with the introduction of this pattern, then downstream users will very likely pick it up and start to do similar things for other caches.