# Timing square function in CUDA

I am trying to compare if there is any speed benefit to use Nvidia’s math pow function than my own square kernel.

Here is the code.

``````using CuArrays, CUDAnative, CUDAdrv

function square(a, ndrange)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
if(i <  ndrange+1)
a[i] = a[i] * a[i]
end
return
end

function nv_pow(a, ndrange)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
if(i <  ndrange+1)
a[i] = CUDAnative.pow(a[i], Int32(2))
end
return
end

dims = (1000,1000)
a = rand(Float64,dims)
# display(a)
d_a = CuArray(a)

println("size of CuArray d_a : \$(sizeof(d_a))")
ndrange = prod(dims)
println("blocks is \$blocks")
result = Array(d_a);
d_a2 = CuArray(a)
println("my_time = \$my_time, nv_time = \$nv_time")
``````

What puzzles me is that no matter how big I set the dimension, 10x10, or 5000x5000, CUDAdrv.@elapsed always gives me the same timing respectively.
That is, `my_time` is always around 0.0505 s, and `nv_time` is always around 0.1279s (on my machine).
Also the second puzzle is `my_time` is faster than `nv_time`, I think even if Nvidia does not do any optimization, it should at least be the same as my kernel.
Am I timing the actual kernel running time correctly?

See GPU randn way slower than rand?
I think you need the @sync or equivalent `CUDAdrv.synchronize()`

Couple of things:

• you’re not warming up the compiler, so your timings include compilation time.
• if you don’t understand the timings, run under `nvprof` (together with `CUDAdrv.@profile` and `nvprof --profile-from-start off` to get cleaner profiles) which always shows the raw kernel times. if there’s any disparity, as there would have been here, the issue lies with the benchmark driver
• for benchmarking such a simple operation, why not use CuArrays and BenchmarkTools? It would have avoided the issue altogether:
``````julia> A = CuArray{Float32}(undef, 1024, 1024);

julia> using BenchmarkTools, Statistics

julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
memory estimate:  3.58 KiB
allocs estimate:  66
--------------
minimum time:     19.625 ms (0.00% GC)
median time:      20.011 ms (0.00% GC)
mean time:        20.227 ms (0.00% GC)
maximum time:     29.341 ms (0.00% GC)
--------------
samples:          248
evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, Int32(2)), A, A);

julia> judge(median(b), median(a))
BenchmarkTools.TrialJudgement:
time:   +3.08% => invariant (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)

julia> c = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, 2f0), A, A);

julia> judge(median(c), median(a))
BenchmarkTools.TrialJudgement:
time:   +6.19% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)

julia> d = @benchmark CuArrays.@sync map!(x->CUDAnative.pow_fast(x, 2f0), A, A);

julia> judge(median(d), median(a))
BenchmarkTools.TrialJudgement:
time:   +8.78% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
``````
• `I think even if Nvidia does not do any optimization, it should at least be the same as my kernel`: no, `pow` is intended to be invoked with unknown exponents.
• @y4lu, `I think you need the @sync`: only if you measure from the host, eg. using BenchmarkTools. OP uses `CUDAdrv.@elapsed` which inserts performance events in the command queue, which directly measure GPU times. I typically recommend using BenchmarkTools and `CuArrays.@sync` though, because it avoids issues like including compilation time as in this thread.
1 Like

Hi Tim,

This is really helpful. I tried to insert another line to warm up the device but got same result as I did before. So I don’t think it is a warm up issue.

But the nvprof is working. Thank you!

I tried you code to use benchmark and CuArrray.@sync. And I have several questions:

1. This line didn’t work for me, saying that Julia couldn’t find the function. I deleted the `undef` (like this `Julia> A = CuArray{Float32}(1024, 1024);`) and it worked. Not sure if it is a new syntax. I was using Julia 1.0.2.
So the @benchmark combined with CuArrays.@sync would time the total time of execution, that is, it includes data transfer and kernel running time, right?

2. Also, I think there must be some type of clearing cache function to be called in between the @benchmark lines. Because I tried this

``````julia> using BenchmarkTools, CuArrays
julia> A = CuArray{Float32}(1024, 1024);
julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
memory estimate:  3.50 KiB
allocs estimate:  64
--------------
minimum time:     3.325 ms (0.00% GC)
median time:      4.028 ms (0.00% GC)
mean time:        4.020 ms (0.00% GC)
maximum time:     4.749 ms (0.00% GC)
--------------
samples:          1243
evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
memory estimate:  3.50 KiB
allocs estimate:  64
--------------
minimum time:     5.860 ms (0.00% GC)
median time:      6.275 ms (0.00% GC)
mean time:        6.275 ms (0.00% GC)
maximum time:     7.566 ms (0.00% GC)
--------------
samples:          796
evals/sample:     1

julia> c = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
memory estimate:  3.50 KiB
allocs estimate:  64
--------------
minimum time:     7.549 ms (0.00% GC)
median time:      7.930 ms (0.00% GC)
mean time:        7.941 ms (0.00% GC)
maximum time:     9.529 ms (0.00% GC)
--------------
samples:          630
evals/sample:     1
``````

See that I called the same function, but got increased time for functions that are executed later.

1. Last question, I don’t see the speed difference between pow and pow_fast. Do you have a different experience?

Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.

No, it only includes data transfer times if that’s part of the expression you’re `@benchmark`ing. The `@sync` makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.

Hard to be sure what’s happening there. You can always run that code under `nvprof` too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.

I’ve definitely seen the `_fast` versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.