Hey there,
I have a kernel which computes some 3D CuArray of float32s, and I’d like to run argmin on one of the axes, resulting in a 2D set of indices. For some reason, if I invoke CUDAnative.argmin on the original host array directly, it is extremely fast, but if I run it on the device copy, it is prohibitively slow (slower than simply using the CPU, or a manual loop inside a GPU kernel). Below is a MWE and the output I get.
So my question is, is this possibly a bug? If not, why isn’t using the array already on the device faster?
using CUDAdrv, CUDAnative, CuArrays
function main()
# iterate so JIT warms up
for iteration in 1:3
println("\niteration ", iteration)
input = rand(4, 4, 1000)
cpu_min = @time argmin(input, dims=3)
cu_in = CuArray(input)
dt = CUDAdrv.@elapsed begin
# this should be uploading and then doing argmin
gpu_min = CUDAnative.argmin(input, dims=3)
end
println(dt)
@assert gpu_min == cpu_min
dt = CUDAdrv.@elapsed begin
# `cu_in` is already on device?
gpu_min = CUDAnative.argmin(cu_in, dims=3)
end
println(dt)
@assert gpu_min == cpu_min
end
end
main()
Output:
iteration 1
0.135498 seconds (19.47 k allocations: 884.571 KiB)
1.824e-6
6.534424
iteration 2
0.000183 seconds (12 allocations: 1.016 KiB)
2.4e-6
1.3460175
iteration 3
0.000180 seconds (12 allocations: 1.016 KiB)
2.432e-6
1.3351139
Thanks,
Alex