Hi all,
I am trying my first steps with CUDA.jl and I have a question about memory layout. I have a simple unmutable structure, for example:
struct Angles
x::SVector{3, Float64}
end
I have operations like
function add!(r::Angles, x::Angles, y:.Angles)
for k in 1:3
r.x[i] = x.x[i] + y.x[i]
end
return nothing
end
function myexp!(r::Angles, x::Angles)
r.x[1] = sin(x.x[1])
r.x[2] = cos(x.x[2])
r.x[3] = exp(x.x[3])
return nothing
end
And finally I would like to perform these operations in parallel on the GPU:
r_d = CuArray{Angles,1}(1000)
function kernel_operation!(r, x, y)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
tmp = Angles(0.0, 0.0, 0.0)
for i = index:stride:length(y)
myexp!(tmp, x[i])
add!(r, tmp, y [i])
end
What worries me (I am new to CUDA.jl) is that this would be very innefficient in, for example, C. You want the dimension of the array you paralelize to be contiguos in memory, while this layout produces the components of Angles contiguous in memory. Also I would like to make sure that the routines myexp! and add! are inlined inside the loop.
What are the canonical ways to approach these problems in CUDA.jl? Are there easy ways to create an Structure of arrays, but having the possibility to define these elementary add! and myexp! routines and having them inlined?
Thanks