Hey Julianners,
I don’t know what do I do wrong, but I use a configurator
that calculates the ideal kernel config that was used here: https://discourse.julialang.org/t/the-most-general-way-to-estimate-the-optimal-arguments-for-cuda-macro/39342/6. I know strides is something that we could use but actually the configurator
doesn’t suggest any stride in this case when I checked the config value.
function configurator(B, kernel)
config = launch_configuration(kernel.fun);
threads = Base.min(B, config.threads);
blocks = cld(B, threads);
return threads, blocks
mean_by_1000(X,Y) = @inbounds begin
I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
I > size(X,1) && return
cY = CUDA.Const(Y)
GI = (I-1)*1000
tmp = 0f0
for i in 1:1000
tmp += cY[GI+i]
X[I] = tmp / 1000f0
N = 16000; A = CUDA.randn(Float32, N); B = CUDA.randn(Float32, N*1000);
fb_th, fb_blk = configurator(N, @cuda launch=false mean_by_1000(A,B))
print("Threads: $fb_th, blocks: $fb_blk (cudaauto config) ")
@btime CUDA.@sync @cuda threads=fb_th blocks=fb_blk mean_by_1000($A,$B)
nth=1024; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=768 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=512 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=256 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=128 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=64 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=32 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=16 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
nth=8 ; print("Threads: $nth, blocks: $(cld(N, nth)) (manual config) "); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N, nth) mean_by_1000($A,$B))
Results on a Nvidia 3090:
Threads: 768, blocks: 21 (**cudaauto** config) 254.544 μs (296 allocations: 9.42 KiB)
Threads: 1024, blocks: 16 (manual config) 338.053 μs (384 allocations: 12.17 KiB)
Threads: 768, blocks: 21 (manual config) 254.754 μs (394 allocations: 12.48 KiB)
Threads: 512, blocks: 32 (manual config) 231.680 μs (328 allocations: 10.42 KiB)
Threads: 256, blocks: 63 (manual config) 200.551 μs (126 allocations: 4.11 KiB)
Threads: 128, blocks: 125 (manual config) 191.253 μs (338 allocations: 10.73 KiB)
Threads: 64, blocks: 250 (manual config) 188.438 μs (216 allocations: 6.92 KiB)
Threads: 32, blocks: 500 (manual config) 184.962 μs (312 allocations: 9.92 KiB)
Threads: 16, blocks: 1000 (manual config) 191.474 μs (245 allocations: 7.81 KiB)
Threads: 8, blocks: 2000 (manual config) 159.934 μs (259 allocations: 8.25 KiB)
So it is really interesting to see, the config actually does really matter and why didn’t the launch_configuration(kernel.fun) found the best configuration? Maybe it spares resources and wants to maximise parts and this is the best way or am I just doing something wrong? I guess this config becomes more and more important as the function size increase.