Column-wise reduction on a CUDA.CuArray matrix

As suggested here, I am trying to create a version of this findfirst kernel that operates over dimension 2 of input matrix xs and returns the first match for each vector along dimension 1.

I think a good approach is a column-wise reduction as discussed here.
Note that this is ported from a suggestion offered by one of the mods at the nvidia dev forum, who claims that it launches one thread per column, which I don’t

As a first step, I tried finding the first row of each column of arr > 0.5, with size(arr) constant:

function kernel(arr::CuDeviceArray,result::CuDeviceArray)
    thresh = Float32(0.5)
    nrow = Int32(3)
    ncol = Int32(100)
    idx = threadIdx().x+blocDim().x*blocIdx().x
    if idx < ncol
      res = Int32(-1)
      i = int32(0)
      while i < nrow && res == Int32(-1)
        if arr[i] > thresh
          res = i
        result[idx] = res

Now launch this setting nthreads to the number of rows and nblocks to the number of columns, so that all the columns get processed in parallel.
Get a weird error:

check = CUDA.rand(3,100)
nthreads = size(check)[1]
nblocks = size(check)[2]
rslt = CUDA.fill(Int32(0),size(check)[2])
@cuda threads=nthreads blocks=nblocks kernel(check,rslt)

ERROR: MethodError: no method matching kernel(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Int32,1,CUDA.AS.Global})

Obviously the CuArray arguments to kernel() are getting passed as CuDeviceArray with ptr::CUDA.AS.Global and CUDA.jl wanted some other value for the ptr parameter.

I’m now stuck on what the ptr parameter means and how CuArray becomes CuDeviceArray.