Timing square function in CUDA

See GPU randn way slower than rand?
I think you need the @sync or equivalent CUDAdrv.synchronize()