Column-wise reduction on a CUDA.CuArray matrix

As suggested here, I am trying to create a version of this findfirst kernel that operates over dimension 2 of input matrix xs and returns the first match for each vector along dimension 1.

I think a good approach is a column-wise reduction as discussed here.
Note that this is ported from a suggestion offered by one of the mods at the nvidia dev forum, who claims that it launches one thread per column, which I don’t

As a first step, I tried finding the first row of each column of arr > 0.5, with size(arr) constant:

function kernel(arr::CuDeviceArray,result::CuDeviceArray)
    thresh = Float32(0.5)
    nrow = Int32(3)
    ncol = Int32(100)
    idx = threadIdx().x+blocDim().x*blocIdx().x
    if idx < ncol
      res = Int32(-1)
      i = int32(0)
      while i < nrow && res == Int32(-1)
        if arr[i] > thresh
          res = i
        end
        result[idx] = res
      end
    end
end

Now launch this setting nthreads to the number of rows and nblocks to the number of columns, so that all the columns get processed in parallel.
Get a weird error:


check = CUDA.rand(3,100)
nthreads = size(check)[1]
nblocks = size(check)[2]
rslt = CUDA.fill(Int32(0),size(check)[2])
@cuda threads=nthreads blocks=nblocks kernel(check,rslt)

ERROR: MethodError: no method matching kernel(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Int32,1,CUDA.AS.Global})

Obviously the CuArray arguments to kernel() are getting passed as CuDeviceArray with ptr::CUDA.AS.Global and CUDA.jl wanted some other value for the ptr parameter.

I’m now stuck on what the ptr parameter means and how CuArray becomes CuDeviceArray.