Yeah I’m working with CuArrays master, where we’ve been moving towards Base Array APIs.
No, it only includes data transfer times if that’s part of the expression you’re @benchmarking. The @sync makes sure asynchronous operations, like kernels, are included in the timings. If you want to include memory transfers you should allocate/execute/transfer back.
Hard to be sure what’s happening there. You can always run that code under nvprof too. Maybe the allocator acting up? After a while, the GPU allocation pool will have been exhausted which causes a GC sweep and/or new actual allocations.
I’ve definitely seen the _fast versions of intrinsics execute, well, faster, but it depends on the application as well as the GPU. YMMV.