Program has different results when ran in REPL and from Nsight compute

index = CuArray{UInt32}([1])
function ker(index)
    i = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    @cuprintln "i: " i
    @cuprintln "index: " index[]
    @cuprintln "old_index: " index[] += 1

    return nothing
end
@cuda threads=1 blocks=1 ker(index)

In the REPL this prints

i: 1
index: 1
old_index: 2

whereas in nsight compute it prints

i: 1
index: 1
old_index: 2
i: 1
index: 2
old_index: 3
i: 1
index: 3
old_index: 4
i: 1
index: 4
old_index: 5
i: 1
index: 5
old_index: 6
i: 1
index: 6
old_index: 7
i: 1
index: 7
old_index: 8
i: 1
index: 8
old_index: 9
i: 1
index: 9
old_index: 10
i: 1
index: 10
old_index: 11
i: 1
index: 11
old_index: 12
i: 1
index: 12
old_index: 13
i: 1
index: 13
old_index: 14
i: 1
index: 14
old_index: 15
i: 1
index: 15
old_index: 16
i: 1
index: 16
old_index: 17
i: 1
index: 17
old_index: 18
i: 1
index: 18
old_index: 19
i: 1
index: 19
old_index: 20
i: 1
index: 20
old_index: 21
i: 1
index: 21
old_index: 22
i: 1
index: 22
old_index: 23
i: 1
index: 23
old_index: 24
i: 1
index: 24
old_index: 25
i: 1
index: 25
old_index: 26
i: 1
index: 26
old_index: 27
i: 1
index: 27
old_index: 28
i: 1
index: 28
old_index: 29
i: 1
index: 29
old_index: 30
i: 1
index: 30
old_index: 31
i: 1
index: 31
old_index: 32
i: 1
index: 32
old_index: 33
i: 1
index: 33
old_index: 34
i: 1
index: 34
old_index: 35
i: 1
index: 35
old_index: 36
i: 1
index: 36
old_index: 37
i: 1
index: 37
old_index: 38
i: 1
index: 38
old_index: 39
i: 1
index: 39
old_index: 40
i: 1
index: 40
old_index: 41
i: 1
index: 41
old_index: 42
i: 1
index: 42
old_index: 43
i: 1
index: 43
old_index: 44
i: 1
index: 44
old_index: 45

It seems nsight compute reruns the kernel many times which causes the index variable to be wrong.
Is there a way to fix this as I’m using index to index into data and it’s causing out of bounds errors in nsight compute. If it matters I’m stuck on Nsight compute version 2019.5.1 as I’m using a gtx 1070.

Doing

index = CuArray{UInt32}([1])
function ker(index)
    i = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x
    if i == 1
        index[] = 1
    end
    @cuprintln "i: " i
    @cuprintln "index: " index[]
    @cuprintln "old_index: " index[] += 1

    return nothing
end
@cuda threads=1 blocks=1 ker(index)

Seems to work even with multiple threads.

EDIT: the i == 1 branch sometimes doesn’t execute first, i.e. i = 2 might run before so this will give incorrect results.