Performance of kernel function

I tried executing the following function on the GPU, for array sizes of 10,000.

function update!(s,c,cedg,θ)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  stride = blockDim().x * gridDim().x
  @inbounds for l=index:stride:length(s)
                s[l]=cedg[l]*CUDAnative.sin(θ[l])
                c[l]=cedg[l]*CUDAnative.cos(θ[l])
            end
end   

However, the performance is similar to the code run on the CPU. Is there something wrong in the way it is written?

Sorry, I had posted a different version. I have now updated it with the one I use.

s     = zeros(10000);
c     = zeros(10000);
cedg  = rand((0,1),10000) .* randn(10000);
θ     = randn(10000);
cs    = cu(s);
cc    = cu(c);
ccedg = cu(cedg);
cθ    = cu(θ);

function update!(s,c,cedg,θ)
  @inbounds for l=eachindex(s)
    s[l]=cedg[l]*sin(θ[l])
    c[l]=cedg[l]*cos(θ[l])
  end
end
function update!(s::CuArray,c,cedg,θ)
  s .= cedg.*sin.(θ)
  c .= cedg.*cos.(θ)
end

@btime update!($s,$c,$cedg,$θ);
@btime update!($cs,$cc,$ccedg,$cθ);

julia> @btime update!($s,$c,$cedg,$θ);

  132.067 μs (0 allocations: 0 bytes)

julia> @btime update!($cs,$cc,$ccedg,$cθ);
  10.649 μs (108 allocations: 4.38 KiB)

The vectors have to be long enough for it to be worth it though

@baggepinnen: Thank you. This worked, I was trying to follow the tutorial on GPU programming using CuArrays and was trying to fit my functions like the ones in the example. One clarification, do I not have to specify the number of threads and blocks or even mention @cuda for it to be executed on the GPU?

1 Like