CUDA kernel configuration

Marcell_Havlik · March 24, 2022, 4:41pm

Hey Julianners,

I don’t know what do I do wrong, but I use a configurator that calculates the ideal kernel config that was used here: https://discourse.julialang.org/t/the-most-general-way-to-estimate-the-optimal-arguments-for-cuda-macro/39342/6. I know strides is something that we could use but actually the configurator doesn’t suggest any stride in this case when I checked the config value.

function configurator(B, kernel)
	config = launch_configuration(kernel.fun);
	threads = Base.min(B, config.threads);
	blocks = cld(B, threads);
	return threads, blocks
end
mean_by_1000(X,Y) = @inbounds begin 
	I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
	I > size(X,1) && return
	cY = CUDA.Const(Y)
	GI = (I-1)*1000
	tmp = 0f0
	for i in 1:1000
		tmp += cY[GI+i]
	end
	X[I] = tmp / 1000f0
	return
end
N = 16000; A = CUDA.randn(Float32, N); B = CUDA.randn(Float32, N*1000); 

fb_th, fb_blk = configurator(N, @cuda launch=false mean_by_1000(A,B))
print("Threads: $fb_th, blocks: $fb_blk  (cudaauto config) ")
@btime CUDA.@sync @cuda threads=fb_th blocks=fb_blk  mean_by_1000($A,$B)

nth=1024; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=768 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=512 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=256 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=128 ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=64  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=32  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=16  ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))
nth=8   ; print("Threads: $nth, blocks: $(cld(N, nth))  (manual   config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth)  mean_by_1000($A,$B))

Results on a Nvidia 3090:

Threads: 768, blocks: 21  (**cudaauto** config)   254.544 μs (296 allocations: 9.42 KiB)
Threads: 1024, blocks: 16  (manual   config)   338.053 μs (384 allocations: 12.17 KiB)
Threads: 768, blocks: 21  (manual   config)   254.754 μs (394 allocations: 12.48 KiB)
Threads: 512, blocks: 32  (manual   config)   231.680 μs (328 allocations: 10.42 KiB)
Threads: 256, blocks: 63  (manual   config)   200.551 μs (126 allocations: 4.11 KiB)
Threads: 128, blocks: 125  (manual   config)   191.253 μs (338 allocations: 10.73 KiB)
Threads: 64, blocks: 250  (manual   config)   188.438 μs (216 allocations: 6.92 KiB)
Threads: 32, blocks: 500  (manual   config)   184.962 μs (312 allocations: 9.92 KiB)
Threads: 16, blocks: 1000  (manual   config)   191.474 μs (245 allocations: 7.81 KiB)
Threads: 8, blocks: 2000  (manual   config)   159.934 μs (259 allocations: 8.25 KiB)

So it is really interesting to see, the config actually does really matter and why didn’t the launch_configuration(kernel.fun) found the best configuration? Maybe it spares resources and wants to maximise parts and this is the best way or am I just doing something wrong? I guess this config becomes more and more important as the function size increase.

Bests,
Marcell

maleadt · March 25, 2022, 12:05pm

You are ignoring the minimum suggested block size returned by the occupancy API. For some kernels, it is more important to ensure you’re launching that many blocks rather than first maximizing the size of each block.

What do you mean by that?

torrance · March 26, 2022, 4:28am

Wow, I very much didn’t know that the launch_configuration object included information on the minimum block count. Is this what is returned by config.block ?

So that I understand it: in general I want to use all blocks, and possibly reduce threads per block count to ensure this happens? And if I have many more blocks than is returned by this api, it is preferable to loop within the blocks? Is this correct?

maleadt · March 28, 2022, 7:54am

Yes.

That depends on the exact kernel characteristics. If your kernel maxes out certain resources (e.g. bandwidth) with fewer blocks, you don’t need to launch more of them. The occupancy API doesn’t know about that, so you need to interpret its results.

Topic		Replies	Views
The most general way to estimate the optimal arguments for @cuda macro Performance gpu , cudanative	6	1820	April 6, 2021
@cuda threads and blocks confusion GPU	9	3765	February 10, 2021
CUDA: blockdimensions and launch_configuration New to Julia question	0	194	April 17, 2024
How do I make sure that GPU functions use the maximum potential config for performance? GPU	3	350	January 16, 2023
Understanding GPU Kernels GPU	4	2623	April 10, 2018

CUDA kernel configuration

Related topics