See GPU randn way slower than rand? I think you need the @sync or equivalent CUDAdrv.synchronize()
CUDAdrv.synchronize()