Memory Scratch Allocation Strategy Recommendations

I have some small GPU program (finite element assembly of a heat problem) which I have written using a combination of CUDA and KernelAbstractions. The main kernel will be called repeatedly, so I want to preallocate the necessary buffers. Moving all data every kernel call to the GPU not an option, as there is significant data to move around, and I want these repeated calls to be as fast as possible. To allocate buffers which have the correct size, in my first attempt I have defined a helper struct

struct TaskDescriptor
    device
    num_workers
end

where num_workers is right now the number of GPU threads (i.e. the launch uses @cuda threads=num_workers f(...) ) and device is the KernelAbstractions device. Let us assume that the kernel should simply fill num_workers square matrices and is called repeatedly. I have not yet included the blocks to get things working in a first step. My questions are now:

  1. For this repeated kernel call, what is the recommended way to find a good number of threads and blocks in advance (and hence the stride for grid-stride loops)?
  2. Are there any design recommendations or examples regarding the setup logic for the GPU buffers? Let me elaborate on a specific example. Let us say I have a struct with its CPU constructor MyCache(X) which is used in all tutorials to get the CPU caches up and running, then right now I add a dispatch MyCache(TaskDescriptor(CudaDevice(), num_workers), X) which does not return MyCache anymore, but MyGPUCache. I do not really like that design tho, because it feels confusing. I really want to make sure to have a good design here, because when I start with the introduction of this pattern, then downstream users will very likely pick it up and start to do similar things for other caches.

If this works Performance Tips | AMDGPU.jl (this applies to CUDA so you can replace the AMD stuff with cuda.