I tried executing the following function on the GPU, for array sizes of 10,000.
function update!(s,c,cedg,θ) index = (blockIdx().x - 1) * blockDim().x + threadIdx().x stride = blockDim().x * gridDim().x @inbounds for l=index:stride:length(s) s[l]=cedg[l]*CUDAnative.sin(θ[l]) c[l]=cedg[l]*CUDAnative.cos(θ[l]) end end
However, the performance is similar to the code run on the CPU. Is there something wrong in the way it is written?