Suppose, I have a multidimensional array and I want to perform some operation along one of its axes. A sample generic code can look like this:
s = (1, 2, 3, 4, 5) # shape of the array
dim = 3 # dimension, along which I want to perform the operation
A = zeros(s)
spre = s[1:dim-1]
sdim = s[dim]
spost = s[dim+1:end]
Cpre = CartesianIndices(spre)
Cpost = CartesianIndices(spost)
for Ipost in Cpost
for Ipre in Cpre
for i=1:sdim
A[Ipre, i, Ipost] = i
end
end
end
Next, I would like to launch these for
loops on GPU using CUDA. To do it in a most efficient way I have to merge the two outer loops into one with an index through which I can iterate with strides. For that reason I need to merge the two Cartesian indices Cpre
and Cpost
into one set of indices. On CPU I came to the following solution:
Cprepost = CartesianIndices((spre..., spost...))
for I in Cprepost
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
By analogy, I would expect that the kernel for GPU will look like
using CUDA
function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
for I in Cprepost[id:stride:end]
Ipre = Tuple(I)[1:dim-1]
Ipost = Tuple(I)[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
return nothing
end
Agpu = CUDA.zeros(s)
@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
However, this code causes “unsupported call through a literal pointer” error.
The only kernel which so far works for me is the following one:
function kernel(A, dim, Cprepost, sdim)
id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
dim = 3
for k=id:stride:length(Cprepost)
Ipre = Tuple(Cprepost[k])[1:dim-1]
Ipost = Tuple(Cprepost[k])[dim:end]
for i=1:sdim
A[Ipre..., i, Ipost...] = i
end
end
return nothing
end
@cuda threads=prod(spre)*prod(spost) kernel(Agpu, dim, Cprepost, sdim)
However, here I have to explicitly define the dim
variable within the kernel, because otherwise I obtain the same “unsupported call through a literal pointer” error.
Can you please help me to write the corresponding CUDA kernel.
Thank you.