Timing square function in CUDA

cuchxq · December 7, 2018, 10:45pm

I am trying to compare if there is any speed benefit to use Nvidia’s math pow function than my own square kernel.

Here is the code.

using CuArrays, CUDAnative, CUDAdrv

function square(a, ndrange) 
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if(i <  ndrange+1)       
        a[i] = a[i] * a[i]
    end
    return 
end

function nv_pow(a, ndrange)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if(i <  ndrange+1) 
        a[i] = CUDAnative.pow(a[i], Int32(2))
    end
    return
end

dims = (1000,1000)
a = rand(Float64,dims)
# display(a)
d_a = CuArray(a)

println("size of CuArray d_a : $(sizeof(d_a))")
ndrange = prod(dims)
threads=32
blocks = max(Int(ceil(ndrange/threads)), 1)
println("blocks is $blocks")
my_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads square(d_a, ndrange);
result = Array(d_a);
d_a2 = CuArray(a)
nv_time = CUDAdrv.@elapsed @cuda blocks=blocks threads=threads nv_pow(d_a2, ndrange);
println("my_time = $my_time, nv_time = $nv_time")

What puzzles me is that no matter how big I set the dimension, 10x10, or 5000x5000, CUDAdrv.@elapsed always gives me the same timing respectively.
That is, my_time is always around 0.0505 s, and nv_time is always around 0.1279s (on my machine).
Also the second puzzle is my_time is faster than nv_time, I think even if Nvidia does not do any optimization, it should at least be the same as my kernel.
Am I timing the actual kernel running time correctly?

y4lu · December 8, 2018, 3:41am

See GPU randn way slower than rand?
I think you need the @sync or equivalent CUDAdrv.synchronize()

maleadt · December 8, 2018, 9:11am

Couple of things:

you’re not warming up the compiler, so your timings include compilation time.
if you don’t understand the timings, run under nvprof (together with CUDAdrv.@profile and nvprof --profile-from-start off to get cleaner profiles) which always shows the raw kernel times. if there’s any disparity, as there would have been here, the issue lies with the benchmark driver
for benchmarking such a simple operation, why not use CuArrays and BenchmarkTools? It would have avoided the issue altogether:

julia> A = CuArray{Float32}(undef, 1024, 1024);

julia> using BenchmarkTools, Statistics

julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial: 
  memory estimate:  3.58 KiB
  allocs estimate:  66
  --------------
  minimum time:     19.625 ms (0.00% GC)
  median time:      20.011 ms (0.00% GC)
  mean time:        20.227 ms (0.00% GC)
  maximum time:     29.341 ms (0.00% GC)
  --------------
  samples:          248
  evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, Int32(2)), A, A);

julia> judge(median(b), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +3.08% => invariant (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

julia> c = @benchmark CuArrays.@sync map!(x->CUDAnative.pow(x, 2f0), A, A);

julia> judge(median(c), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +6.19% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

julia> d = @benchmark CuArrays.@sync map!(x->CUDAnative.pow_fast(x, 2f0), A, A);

julia> judge(median(d), median(a))
BenchmarkTools.TrialJudgement: 
  time:   +8.78% => regression (5.00% tolerance)
  memory: +0.00% => invariant (1.00% tolerance)

I think even if Nvidia does not do any optimization, it should at least be the same as my kernel: no, pow is intended to be invoked with unknown exponents.
@y4lu, I think you need the @sync: only if you measure from the host, eg. using BenchmarkTools. OP uses CUDAdrv.@elapsed which inserts performance events in the command queue, which directly measure GPU times. I typically recommend using BenchmarkTools and CuArrays.@sync though, because it avoids issues like including compilation time as in this thread.

cuchxq · December 11, 2018, 5:31pm

Hi Tim,

This is really helpful. I tried to insert another line to warm up the device but got same result as I did before. So I don’t think it is a warm up issue.

But the nvprof is working. Thank you!

I tried you code to use benchmark and CuArrray.@sync. And I have several questions:

This line didn’t work for me, saying that Julia couldn’t find the function. I deleted the undef (like this Julia> A = CuArray{Float32}(1024, 1024);) and it worked. Not sure if it is a new syntax. I was using Julia 1.0.2.
So the @benchmark combined with CuArrays.@sync would time the total time of execution, that is, it includes data transfer and kernel running time, right?
Also, I think there must be some type of clearing cache function to be called in between the @benchmark lines. Because I tried this

julia> using BenchmarkTools, CuArrays
julia> A = CuArray{Float32}(1024, 1024);
julia> a = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     3.325 ms (0.00% GC)
  median time:      4.028 ms (0.00% GC)
  mean time:        4.020 ms (0.00% GC)
  maximum time:     4.749 ms (0.00% GC)
  --------------
  samples:          1243
  evals/sample:     1

julia> b = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     5.860 ms (0.00% GC)
  median time:      6.275 ms (0.00% GC)
  mean time:        6.275 ms (0.00% GC)
  maximum time:     7.566 ms (0.00% GC)
  --------------
  samples:          796
  evals/sample:     1

julia> c = @benchmark CuArrays.@sync map!(x->x*x, A, A)
BenchmarkTools.Trial:
  memory estimate:  3.50 KiB
  allocs estimate:  64
  --------------
  minimum time:     7.549 ms (0.00% GC)
  median time:      7.930 ms (0.00% GC)
  mean time:        7.941 ms (0.00% GC)
  maximum time:     9.529 ms (0.00% GC)
  --------------
  samples:          630
  evals/sample:     1

See that I called the same function, but got increased time for functions that are executed later.

Last question, I don’t see the speed difference between pow and pow_fast. Do you have a different experience?

maleadt · December 11, 2018, 9:19pm

Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.

No, it only includes data transfer times if that’s part of the expression you’re @benchmarking. The @sync makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.

Hard to be sure what’s happening there. You can always run that code under nvprof too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.

I’ve definitely seen the _fast versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.

Topic		Replies	Views
What is the optimal way of updating CuArray? GPU cudanative	7	1507	July 5, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1509	December 14, 2018
Most efficient way of _waiting_ for GPU results? GPU	20	3041	January 31, 2019
Slow first run inside functions GPU	5	1645	February 4, 2019
GPU randn way slower than rand? Performance gpu , cuda	6	1584	December 3, 2018

Timing square function in CUDA

Related topics