I tried executing the following function on the GPU, for array sizes of 10,000.
function update!(s,c,cedg,θ)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
@inbounds for l=index:stride:length(s)
s[l]=cedg[l]*CUDAnative.sin(θ[l])
c[l]=cedg[l]*CUDAnative.cos(θ[l])
end
end
However, the performance is similar to the code run on the CPU. Is there something wrong in the way it is written?
@baggepinnen: Thank you. This worked, I was trying to follow the tutorial on GPU programming using CuArrays and was trying to fit my functions like the ones in the example. One clarification, do I not have to specify the number of threads and blocks or even mention @cuda for it to be executed on the GPU?