CUDA kernel configuration

Hey Julianners,

I don’t know what do I do wrong, but I use a configurator that calculates the ideal kernel config that was used here: https://discourse.julialang.org/t/the-most-general-way-to-estimate-the-optimal-arguments-for-cuda-macro/39342/6. I know strides is something that we could use but actually the configurator doesn’t suggest any stride in this case when I checked the config value.

function configurator(B, kernel)
	config = launch_configuration(kernel.fun);
	threads = Base.min(B, config.threads);
	blocks = cld(B, threads);
	return threads, blocks
end
mean_by_1000(X,Y) = @inbounds begin 
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	I > size(X,1) && return
	cY = CUDA.Const(Y)
	GI = (I-1)*1000
	tmp = 0f0
	for i in 1:1000
		tmp += cY[GI+i]
	end
	X[I] = tmp / 1000f0
	return
end
N = 16000; A = CUDA.randn(Float32, N); B = CUDA.randn(Float32, N*1000); 

fb_th, fb_blk = configurator(N, @cuda launch=false mean_by_1000(A,B))
print("Threads: $fb_th, blocks: $fb_blk  (cudaauto config) ")
@btime CUDA.@sync @cuda threads=fb_th blocks=fb_blk  mean_by_1000($A,$B)

nth=1024; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=768 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=512 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=256 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=128 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=64  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=32  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=16  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=8   ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))

Results on a Nvidia 3090:

Threads: 768, blocks: 21  (**cudaauto** config)   254.544 μs (296 allocations: 9.42 KiB)
Threads: 1024, blocks: 16  (manual   config)   338.053 μs (384 allocations: 12.17 KiB)
Threads: 768, blocks: 21  (manual   config)   254.754 μs (394 allocations: 12.48 KiB)
Threads: 512, blocks: 32  (manual   config)   231.680 μs (328 allocations: 10.42 KiB)
Threads: 256, blocks: 63  (manual   config)   200.551 μs (126 allocations: 4.11 KiB)
Threads: 128, blocks: 125  (manual   config)   191.253 μs (338 allocations: 10.73 KiB)
Threads: 64, blocks: 250  (manual   config)   188.438 μs (216 allocations: 6.92 KiB)
Threads: 32, blocks: 500  (manual   config)   184.962 μs (312 allocations: 9.92 KiB)
Threads: 16, blocks: 1000  (manual   config)   191.474 μs (245 allocations: 7.81 KiB)
Threads: 8, blocks: 2000  (manual   config)   159.934 μs (259 allocations: 8.25 KiB)

So it is really interesting to see, the config actually does really matter and why didn’t the launch_configuration(kernel.fun) found the best configuration? Maybe it spares resources and wants to maximise parts and this is the best way or am I just doing something wrong? I guess this config becomes more and more important as the function size increase.

Bests,
Marcell

1 Like

You are ignoring the minimum suggested block size returned by the occupancy API. For some kernels, it is more important to ensure you’re launching that many blocks rather than first maximizing the size of each block.

What do you mean by that?

Wow, I very much didn’t know that the launch_configuration object included information on the minimum block count. Is this what is returned by config.block ?

So that I understand it: in general I want to use all blocks, and possibly reduce threads per block count to ensure this happens? And if I have many more blocks than is returned by this api, it is preferable to loop within the blocks? Is this correct?

Yes.

That depends on the exact kernel characteristics. If your kernel maxes out certain resources (e.g. bandwidth) with fewer blocks, you don’t need to launch more of them. The occupancy API doesn’t know about that, so you need to interpret its results.