CUDA: MVectors always allocate memory and cause "Out of Memory Error"

try this simple code:

using CUDAnative
using CUDAdrv
using CuArrays
function kernel(i,arr)
	for i in 1:i
		a = MVector{3,Float32}(undef)
		arr[i] = a[1]+1.0f0
N1 = 1
N2 = 10000
arr1 = CuArray(Array{Float32,1}(undef,N2))
arr2 = CuArray(Array{Float32,1}(undef,N2))
@cuda threads = 10 blocks = 10 kernel(N1,arr1) #work on my machine
@cuda threads = 10 blocks = 10 kernel(N2,arr2) #don't work on my machine,out of memory error

It seems that in the loop, every call to the MVector will allocate a new array, which will quickly run out of the GPU memory.This is not consistent with the behavior on CPU,which doesn’t allocate at all

N2 = 10000
arr = Array{Float32,1}(undef,N2)
@allocated kernel(10000,arr)

Some infos

CuArrays v1.0.2
CUDAdrv v3.0.0
CUDAnative v2.1.0

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

Currently I have to allocate the MVectors before entering the loop, and set each element to zero in the begin of the loop to reinit the MVectors manually.

See, could you try with Julia 1.2?

Yes,this code works fine in Julia 1.2.0-rc1.0.I guess I should move to this new version of Julia!