I tried executing the following function on the GPU, for array sizes of 10,000.
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
@inbounds for l=index:stride:length(s)
However, the performance is similar to the code run on the CPU. Is there something wrong in the way it is written?
Sorry, I had posted a different version. I have now updated it with the one I use.
s = zeros(10000);
c = zeros(10000);
cedg = rand((0,1),10000) .* randn(10000);
θ = randn(10000);
cs = cu(s);
cc = cu(c);
ccedg = cu(cedg);
cθ = cu(θ);
@inbounds for l=eachindex(s)
s .= cedg.*sin.(θ)
c .= cedg.*cos.(θ)
julia> @btime update!($s,$c,$cedg,$θ);
132.067 μs (0 allocations: 0 bytes)
julia> @btime update!($cs,$cc,$ccedg,$cθ);
10.649 μs (108 allocations: 4.38 KiB)
The vectors have to be long enough for it to be worth it though
@baggepinnen: Thank you. This worked, I was trying to follow the tutorial on GPU programming using CuArrays and was trying to fit my functions like the ones in the example. One clarification, do I not have to specify the number of threads and blocks or even mention
@cuda for it to be executed on the GPU?