CuArray/CUDAnative argmin paradoxical performance

Alex_Ellison · January 28, 2019, 8:25pm

Hey there,
I have a kernel which computes some 3D CuArray of float32s, and I’d like to run argmin on one of the axes, resulting in a 2D set of indices. For some reason, if I invoke CUDAnative.argmin on the original host array directly, it is extremely fast, but if I run it on the device copy, it is prohibitively slow (slower than simply using the CPU, or a manual loop inside a GPU kernel). Below is a MWE and the output I get.

So my question is, is this possibly a bug? If not, why isn’t using the array already on the device faster?

using  CUDAdrv, CUDAnative, CuArrays

function main()
	# iterate so JIT warms up
	for iteration in 1:3
		println("\niteration ", iteration)
		input = rand(4, 4, 1000)
		cpu_min = @time argmin(input, dims=3)
	        cu_in = CuArray(input)

		dt = CUDAdrv.@elapsed begin
			# this should be uploading and then doing argmin
	    	 gpu_min = CUDAnative.argmin(input, dims=3)
		end
		println(dt)
		@assert gpu_min == cpu_min

		dt = CUDAdrv.@elapsed begin
			 # `cu_in` is already on device?
	    	 gpu_min = CUDAnative.argmin(cu_in, dims=3)
		end
		println(dt)

		@assert gpu_min == cpu_min
	end
end

main()

Output:

iteration 1
  0.135498 seconds (19.47 k allocations: 884.571 KiB)
1.824e-6
6.534424

iteration 2
  0.000183 seconds (12 allocations: 1.016 KiB)
2.4e-6
1.3460175

iteration 3
  0.000180 seconds (12 allocations: 1.016 KiB)
2.432e-6
1.3351139

Thanks,
Alex

maleadt · January 29, 2019, 7:01am

There is no CUDAnative.argmin, this just resolves to Base. CUDAnative is for kernel programming, you are just using array abstractions here, ie. functionality that is implemented in CuArrays. but argmin isn’t optimized there, you’re just using the Base implementation.

As such, you should do the traditional steps of optimizing CuArrays code. The first is to disable scalar iteration:

julia> argmin(cu(rand(2,2,2)), dims=3)
ERROR: scalar getindex is disallowed
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] assertscalar at /home/tbesard/Julia/GPUArrays/src/indexing.jl:8 [inlined]
 [3] getindex(::CuArray{Float32,3}, ::Int64) at /home/tbesard/Julia/GPUArrays/src/indexing.jl:44
 [4] _findmin at ./abstractarray.jl:270 [inlined]
 [5] #findmin#575 at ./reducedim.jl:764 [inlined]
 [6] #findmin at ./none:0 [inlined]
 [7] #argmin#578(::Int64, ::Function, ::CuArray{Float32,3}) at ./reducedim.jl:853

I haven’t looked at the problem in depth, but some of the lower-level kernels in the stack trace (eg. findmin) should probably be implemented in CuArrays, on top of the GPU mapreduce functionality there.

Alex_Ellison · January 31, 2019, 5:35pm

Cheers

Topic		Replies	Views
CUDAnative is awesome! GPU	12	5977	December 3, 2018
What is the optimal way of updating CuArray? GPU cudanative	7	1504	July 5, 2018
CUDAnative/CuArrays: performance regression for memcopy code GPU question	3	866	April 18, 2019
CUDAnative: register host memory for pinned memory access GPU question	26	4103	September 3, 2021
cuArrays vs CUDANative GPU	3	1362	November 14, 2018

CuArray/CUDAnative argmin paradoxical performance

Related topics