Memory Scratch Allocation Strategy Recommendations

termi-official · February 28, 2026, 2:55pm

I have some small GPU program (finite element assembly of a heat problem) which I have written using a combination of CUDA and KernelAbstractions. The main kernel will be called repeatedly, so I want to preallocate the necessary buffers. Moving all data every kernel call to the GPU not an option, as there is significant data to move around, and I want these repeated calls to be as fast as possible. To allocate buffers which have the correct size, in my first attempt I have defined a helper struct

struct TaskDescriptor
    device
    num_workers
end

where num_workers is right now the number of GPU threads (i.e. the launch uses @cuda threads=num_workers f(...) ) and device is the KernelAbstractions device. Let us assume that the kernel should simply fill num_workers square matrices and is called repeatedly. I have not yet included the blocks to get things working in a first step. My questions are now:

For this repeated kernel call, what is the recommended way to find a good number of threads and blocks in advance (and hence the stride for grid-stride loops)?
Are there any design recommendations or examples regarding the setup logic for the GPU buffers? Let me elaborate on a specific example. Let us say I have a struct with its CPU constructor MyCache(X) which is used in all tutorials to get the CPU caches up and running, then right now I add a dispatch MyCache(TaskDescriptor(CudaDevice(), num_workers), X) which does not return MyCache anymore, but MyGPUCache. I do not really like that design tho, because it feels confusing. I really want to make sure to have a good design here, because when I start with the introduction of this pattern, then downstream users will very likely pick it up and start to do similar things for other caches.

gbaraldi · February 28, 2026, 7:16pm

If this works Performance Tips | AMDGPU.jl (this applies to CUDA so you can replace the AMD stuff with cuda.

Topic		Replies	Views
GPU code has a high amount of CPU allocations? GPU	7	600	February 8, 2023
Tips on writing kernels? GPU	4	981	August 10, 2021
Understanding GPU Kernels GPU	4	2638	April 10, 2018
Pattern for managing thread local storage? General Usage question , multithreading	5	2234	May 17, 2021
Pre-allocated, thread-save buffer arrays General Usage	0	381	March 8, 2021

Memory Scratch Allocation Strategy Recommendations

Related topics