Timing square function in CUDA

I am trying to compare if there is any speed benefit to use Nvidia’s math pow function than my own square kernel.

Here is the code.

using CuArrays, CUDAnative, CUDAdrv

function square(a, ndrange) 
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if(i <  ndrange+1)       
        a[i] = a[i] * a[i]
    end
    return 
end

function nv_pow(a, ndrange)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if(i <  ndrange+1) 
        a[i] = CUDAnative.pow(a[i], Int32(2))
    end
    return
end

dims = (1000,1000)
a = rand(Float64,dims)
# display(a)
d_a = CuArray(a)

println("size of CuArray d_a : $(sizeof(d_a))")
ndrange = prod(dims)
threads=32
blocks = max(Int(ceil(ndrange/threads)), 1)
println("blocks is $blocks")
my_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads square(d_a, ndrange);
result = Array(d_a);
d_a2 = CuArray(a)
nv_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads nv_pow(d_a2, ndrange);
println("my_time = $my_time, nv_time = $nv_time")

What puzzles me is that no matter how big I set the dimension, 10x10, or 5000x5000, CUDAdrv.@elapsed always gives me the same timing respectively.
That is, my_time is always around 0.0505 s, and nv_time is always around 0.1279s (on my machine).
Also the second puzzle is my_time is faster than nv_time, I think even if Nvidia does not do any optimization, it should at least be the same as my kernel.
Am I timing the actual kernel running time correctly?

See GPU randn way slower than rand?
I think you need the @sync or equivalent CUDAdrv.synchronize()

Couple of things:

  • you’re not warming up the compiler, so your timings include compilation time.
  • if you don’t understand the timings, run under nvprof (together with CUDAdrv.@profile and nvprof --profile-from-start off to get cleaner profiles) which always shows the raw kernel times. if there’s any disparity, as there would have been here, the issue lies with the benchmark driver
  • for benchmarking such a simple operation, why not use CuArrays and BenchmarkTools? It would have avoided the issue altogether:
julia> A = CuArray{Float32}(undef, 1024, 1024);

julia> using BenchmarkTools, Statistics

julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial: 
  memory estimate:  3.58 KiB
  allocs estimate:  66
  --------------
  minimum time:     19.625 ms (0.00% GC)
  median time:      20.011 ms (0.00% GC)
  mean time:        20.227 ms (0.00% GC)
  maximum time:     29.341 ms (0.00% GC)
  --------------
  samples:          248
  evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, Int32(2)), A, A);

julia> judge(median(b), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +3.08% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

julia> c = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, 2f0), A, A);

julia> judge(median(c), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +6.19% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

julia> d = @benchmark CuArrays.@sync map!(x->CUDAnative.pow_fast(x, 2f0), A, A);

julia> judge(median(d), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +8.78% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)
  • I think even if Nvidia does not do any optimization, it should at least be the same as my kernel: no, pow is intended to be invoked with unknown exponents.
  • @y4lu, I think you need the @sync: only if you measure from the host, eg. using BenchmarkTools. OP uses CUDAdrv.@elapsed which inserts performance events in the command queue, which directly measure GPU times. I typically recommend using BenchmarkTools and CuArrays.@sync though, because it avoids issues like including compilation time as in this thread.
1 Like

Hi Tim,

This is really helpful. I tried to insert another line to warm up the device but got same result as I did before. So I don’t think it is a warm up issue.

But the nvprof is working. Thank you!

I tried you code to use benchmark and CuArrray.@sync. And I have several questions:

  1. This line didn’t work for me, saying that Julia couldn’t find the function. I deleted the undef (like this Julia> A = CuArray{Float32}(1024, 1024);) and it worked. Not sure if it is a new syntax. I was using Julia 1.0.2.
    So the @benchmark combined with CuArrays.@sync would time the total time of execution, that is, it includes data transfer and kernel running time, right?

  2. Also, I think there must be some type of clearing cache function to be called in between the @benchmark lines. Because I tried this

julia> using BenchmarkTools, CuArrays
julia> A = CuArray{Float32}(1024, 1024);
julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     3.325 ms (0.00% GC)
  median time:      4.028 ms (0.00% GC)
  mean time:        4.020 ms (0.00% GC)
  maximum time:     4.749 ms (0.00% GC)
  --------------
  samples:          1243
  evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     5.860 ms (0.00% GC)
  median time:      6.275 ms (0.00% GC)
  mean time:        6.275 ms (0.00% GC)
  maximum time:     7.566 ms (0.00% GC)
  --------------
  samples:          796
  evals/sample:     1

julia> c = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     7.549 ms (0.00% GC)
  median time:      7.930 ms (0.00% GC)
  mean time:        7.941 ms (0.00% GC)
  maximum time:     9.529 ms (0.00% GC)
  --------------
  samples:          630
  evals/sample:     1

See that I called the same function, but got increased time for functions that are executed later.

  1. Last question, I don’t see the speed difference between pow and pow_fast. Do you have a different experience?

Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.

No, it only includes data transfer times if that’s part of the expression you’re @benchmarking. The @sync makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.

Hard to be sure what’s happening there. You can always run that code under nvprof too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.

I’ve definitely seen the _fast versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.