As suggested here, I am trying to create a version of this findfirst kernel that operates over dimension 2 of input matrix `xs`

and returns the first match for each vector along dimension 1.

I think a good approach is a column-wise reduction as discussed here.

Note that this is ported from a suggestion offered by one of the mods at the nvidia dev forum, who claims that it launches one thread per column, which I don’t

As a first step, I tried finding the first row of each column of `arr`

> 0.5, with `size(arr)`

constant:

```
function kernel(arr::CuDeviceArray,result::CuDeviceArray)
thresh = Float32(0.5)
nrow = Int32(3)
ncol = Int32(100)
idx = threadIdx().x+blocDim().x*blocIdx().x
if idx < ncol
res = Int32(-1)
i = int32(0)
while i < nrow && res == Int32(-1)
if arr[i] > thresh
res = i
end
result[idx] = res
end
end
end
```

Now launch this setting `nthreads`

to the number of rows and `nblocks`

to the number of columns, so that all the columns get processed in parallel.

Get a weird error:

```
check = CUDA.rand(3,100)
nthreads = size(check)[1]
nblocks = size(check)[2]
rslt = CUDA.fill(Int32(0),size(check)[2])
@cuda threads=nthreads blocks=nblocks kernel(check,rslt)
ERROR: MethodError: no method matching kernel(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Int32,1,CUDA.AS.Global})
```

Obviously the `CuArray`

arguments to `kernel()`

are getting passed as `CuDeviceArray`

with `ptr::CUDA.AS.Global`

and `CUDA.jl`

wanted some other value for the `ptr`

parameter.

I’m now stuck on what the `ptr`

parameter means and how `CuArray`

becomes `CuDeviceArray`

.