CuArray/CUDAnative argmin paradoxical performance

Hey there,
I have a kernel which computes some 3D CuArray of float32s, and I’d like to run argmin on one of the axes, resulting in a 2D set of indices. For some reason, if I invoke CUDAnative.argmin on the original host array directly, it is extremely fast, but if I run it on the device copy, it is prohibitively slow (slower than simply using the CPU, or a manual loop inside a GPU kernel). Below is a MWE and the output I get.

So my question is, is this possibly a bug? If not, why isn’t using the array already on the device faster?

using  CUDAdrv, CUDAnative, CuArrays

function main()
	# iterate so JIT warms up
	for iteration in 1:3
		println("\niteration ", iteration)
		input = rand(4, 4, 1000)
		cpu_min = @time argmin(input, dims=3)
	        cu_in = CuArray(input)

		dt = CUDAdrv.@elapsed begin
			# this should be uploading and then doing argmin
	    	 gpu_min = CUDAnative.argmin(input, dims=3)
		end
		println(dt)
		@assert gpu_min == cpu_min

		dt = CUDAdrv.@elapsed begin
			 # `cu_in` is already on device?
	    	 gpu_min = CUDAnative.argmin(cu_in, dims=3)
		end
		println(dt)

		@assert gpu_min == cpu_min
	end
end

main()

Output:

iteration 1
  0.135498 seconds (19.47 k allocations: 884.571 KiB)
1.824e-6
6.534424

iteration 2
  0.000183 seconds (12 allocations: 1.016 KiB)
2.4e-6
1.3460175

iteration 3
  0.000180 seconds (12 allocations: 1.016 KiB)
2.432e-6
1.3351139

Thanks,
Alex

There is no CUDAnative.argmin, this just resolves to Base. CUDAnative is for kernel programming, you are just using array abstractions here, ie. functionality that is implemented in CuArrays. but argmin isn’t optimized there, you’re just using the Base implementation.

As such, you should do the traditional steps of optimizing CuArrays code. The first is to disable scalar iteration:

julia> argmin(cu(rand(2,2,2)), dims=3)
ERROR: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar at /home/tbesard/Julia/GPUArrays/src/indexing.jl:8 [inlined]
 [3] getindex(::CuArray{Float32,3}, ::Int64) at /home/tbesard/Julia/GPUArrays/src/indexing.jl:44
 [4] _findmin at ./abstractarray.jl:270 [inlined]
 [5] #findmin#575 at ./reducedim.jl:764 [inlined]
 [6] #findmin at ./none:0 [inlined]
 [7] #argmin#578(::Int64, ::Function, ::CuArray{Float32,3}) at ./reducedim.jl:853

I haven’t looked at the problem in depth, but some of the lower-level kernels in the stack trace (eg. findmin) should probably be implemented in CuArrays, on top of the GPU mapreduce functionality there.

Cheers