I’m somewhat new to writing gpu code, but I am working on a somewhat complicated kernel function, which involves a number of steps in which shared memory is used to speed up the computations.
To be efficient I reuse the preallocated shared memory in the different parts of the code. In some parts however, it would be beneficial for the shared memory to be tied to only one large array, e.g.
# case 1
shared = @cuDynamicSharedMem(T, 2 * size)
In other parts it would be convenient if the shared memory was partitioned to different variables, e.g.
# case 2
shA = @cuDynamicSharedMem(T, size)
shB = @cuDynamicSharedMem(T, size, sizeof(shA))
So I had the following idea: Allocate as in case 1
, but then create some auxiliary variables which points into the memory. My thought was to do something like this:
struct Partition{T,N}
S::T
i::Int
Partition{N}(S::T, i::Int) = new{T, N}(S, i)
end
getindex(P::Partition{T,N}, i::Int) = P.S[P.i * N + i]
and use it like:
# e.g size = 2^10
shared = @cuDynamicSharedMem(T, 2 * size)
shA = Partition{size}(shared, 0)
shB = Partition{size}(shared, 1)
# shA[i], shB[i] ( = shared[i], shared[size+i] )
My questions are now:
- Is this idea reasonable?
- Is there a better way to do something like this? - something built-in available?
- Is there some performance aspect I’m neglecting where I might be stepping on my own toes by doing this?