Timing square function in CUDA

maleadt · December 11, 2018, 9:19pm

Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.

No, it only includes data transfer times if that’s part of the expression you’re @benchmarking. The @sync makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.

Hard to be sure what’s happening there. You can always run that code under nvprof too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.

I’ve definitely seen the _fast versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.

Topic		Replies	Views
What is the optimal way of updating CuArray? GPU cudanative	7	1577	July 5, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1548	December 14, 2018
Most efficient way of _waiting_ for GPU results? GPU	20	3165	January 31, 2019
Slow first run inside functions GPU	5	1685	February 4, 2019
GPU randn way slower than rand? Performance gpu , cuda	6	1628	December 3, 2018

Timing square function in CUDA

Related topics