Performance of kernel function

I tried executing the following function on the GPU, for array sizes of 10,000.

However, the performance is similar to the code run on the CPU. Is there something wrong in the way it is written?

s     = zeros(10000);
c     = zeros(10000);
cedg  = rand((0,1),10000) .* randn(10000);
θ     = randn(10000);
cs    = cu(s);
cc    = cu(c);
ccedg = cu(cedg);
cθ    = cu(θ);

function update!(s,c,cedg,θ)
  @inbounds for l=eachindex(s)
function update!(s::CuArray,c,cedg,θ)
  s .= cedg.*sin.(θ)
  c .= cedg.*cos.(θ)

@btime update!($s,$c,$cedg,$θ);
@btime update!($cs,$cc,$ccedg,$cθ);

julia> @btime update!($s,$c,$cedg,$θ);

  132.067 μs (0 allocations: 0 bytes)

julia> @btime update!($cs,$cc,$ccedg,$cθ);
  10.649 μs (108 allocations: 4.38 KiB)

The vectors have to be long enough for it to be worth it though

@baggepinnen: Thank you. This worked, I was trying to follow the tutorial on GPU programming using CuArrays and was trying to fit my functions like the ones in the example. One clarification, do I not have to specify the number of threads and blocks or even mention @cuda for it to be executed on the GPU?

