Timing square function in CUDA

Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.

No, it only includes data transfer times if that’s part of the expression you’re @benchmarking. The @sync makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.

Hard to be sure what’s happening there. You can always run that code under nvprof too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.

I’ve definitely seen the _fast versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.