As suggested here, I am trying to create a version of this findfirst kernel that operates over dimension 2 of input matrix xs
and returns the first match for each vector along dimension 1.
I think a good approach is a column-wise reduction as discussed here.
Note that this is ported from a suggestion offered by one of the mods at the nvidia dev forum, who claims that it launches one thread per column, which I don’t
As a first step, I tried finding the first row of each column of arr
> 0.5, with size(arr)
constant:
function kernel(arr::CuDeviceArray,result::CuDeviceArray)
thresh = Float32(0.5)
nrow = Int32(3)
ncol = Int32(100)
idx = threadIdx().x+blocDim().x*blocIdx().x
if idx < ncol
res = Int32(-1)
i = int32(0)
while i < nrow && res == Int32(-1)
if arr[i] > thresh
res = i
end
result[idx] = res
end
end
end
Now launch this setting nthreads
to the number of rows and nblocks
to the number of columns, so that all the columns get processed in parallel.
Get a weird error:
check = CUDA.rand(3,100)
nthreads = size(check)[1]
nblocks = size(check)[2]
rslt = CUDA.fill(Int32(0),size(check)[2])
@cuda threads=nthreads blocks=nblocks kernel(check,rslt)
ERROR: MethodError: no method matching kernel(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Int32,1,CUDA.AS.Global})
Obviously the CuArray
arguments to kernel()
are getting passed as CuDeviceArray
with ptr::CUDA.AS.Global
and CUDA.jl
wanted some other value for the ptr
parameter.
I’m now stuck on what the ptr
parameter means and how CuArray
becomes CuDeviceArray
.