I am trying to compare if there is any speed benefit to use Nvidia’s math pow function than my own square kernel.
Here is the code.
using CuArrays, CUDAnative, CUDAdrv function square(a, ndrange) i = (blockIdx().x-1) * blockDim().x + threadIdx().x if(i < ndrange+1) a[i] = a[i] * a[i] end return end function nv_pow(a, ndrange) i = (blockIdx().x-1) * blockDim().x + threadIdx().x if(i < ndrange+1) a[i] = CUDAnative.pow(a[i], Int32(2)) end return end dims = (1000,1000) a = rand(Float64,dims) # display(a) d_a = CuArray(a) println("size of CuArray d_a : $(sizeof(d_a))") ndrange = prod(dims) threads=32 blocks = max(Int(ceil(ndrange/threads)), 1) println("blocks is $blocks") my_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads square(d_a, ndrange); result = Array(d_a); d_a2 = CuArray(a) nv_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads nv_pow(d_a2, ndrange); println("my_time = $my_time, nv_time = $nv_time")
What puzzles me is that no matter how big I set the dimension, 10x10, or 5000x5000, CUDAdrv.@elapsed always gives me the same timing respectively.
my_time is always around 0.0505 s, and
nv_time is always around 0.1279s (on my machine).
Also the second puzzle is
my_time is faster than
nv_time, I think even if Nvidia does not do any optimization, it should at least be the same as my kernel.
Am I timing the actual kernel running time correctly?