I have a kernel which computes some 3D CuArray of float32s, and I’d like to run argmin on one of the axes, resulting in a 2D set of indices. For some reason, if I invoke CUDAnative.argmin on the original host array directly, it is extremely fast, but if I run it on the device copy, it is prohibitively slow (slower than simply using the CPU, or a manual loop inside a GPU kernel). Below is a MWE and the output I get.
So my question is, is this possibly a bug? If not, why isn’t using the array already on the device faster?
using CUDAdrv, CUDAnative, CuArrays function main() # iterate so JIT warms up for iteration in 1:3 println("\niteration ", iteration) input = rand(4, 4, 1000) cpu_min = @time argmin(input, dims=3) cu_in = CuArray(input) dt = CUDAdrv.@elapsed begin # this should be uploading and then doing argmin gpu_min = CUDAnative.argmin(input, dims=3) end println(dt) @assert gpu_min == cpu_min dt = CUDAdrv.@elapsed begin # `cu_in` is already on device? gpu_min = CUDAnative.argmin(cu_in, dims=3) end println(dt) @assert gpu_min == cpu_min end end main()
iteration 1 0.135498 seconds (19.47 k allocations: 884.571 KiB) 1.824e-6 6.534424 iteration 2 0.000183 seconds (12 allocations: 1.016 KiB) 2.4e-6 1.3460175 iteration 3 0.000180 seconds (12 allocations: 1.016 KiB) 2.432e-6 1.3351139