Writing fast stencil computation kernels that work on both CPUs and GPUs

maleadt · January 29, 2019, 7:33am

In your kernels, I don’t see any threadIdx anywhere, so you’re just doing the same serial computation on 128 threads? You need to parallelize manually, eg. map your 128x128x128 over the block/grid configuration and have every thread perform what’s basically a single iteration of the for loop. There’s no autoparallelizing magic going on here, just adding @cuda threads=... won’t make something parallel.

Alternatively, you can use abstractions that have this parallelization built in, such as broadcast. Your last solution uses two fused broadcast kernels like that. Since it’s a very simple stencil kernel though, there’s both some overhead from using broadcast, as well as the fact that you’re using two kernels. Fusing these together in an explicit CUDAnative stencil kernel should easily get you another 2x.

EDIT: gave it a quick go

julia> function kernel(f, δxf)
         Nx, Ny, Nz = size(f)
         k = threadIdx().x
         j = blockIdx().x
         i = blockIdx().y
         δxf[i, j, k] = δx(f, Nx, i, j, k)
         return
       end
julia> @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)

julia> @test yg ≈ yc
Test Passed

julia> @benchmark CuArrays.@sync @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)
BenchmarkTools.Trial: 
  memory estimate:  1.31 KiB
  allocs estimate:  31
  --------------
  minimum time:     8.043 μs (0.00% GC)
  median time:      6.456 ms (0.00% GC)
  mean time:        6.266 ms (0.00% GC)
  maximum time:     6.957 ms (0.00% GC)
  --------------
  samples:          266
  evals/sample:     3

Different hardware, different characteristics. But this only uses 128 threads which is heavily underusing my GPU, you’d generally (for trivial kernels like this) want to max out the thread count that your device supports (typically 512 or 2014) and do a more elaborate computation between threads/blocks to figure out the indices.
Furthermore, implementing stencils efficiently has been researched heavily so there’s loads of possible improvements.

Topic		Replies	Views
Generic way to write array stencils/kernels for CPUs and GPUs? General Usage	5	610	September 28, 2018
Writing stencils for CuArray GPU	6	1144	July 31, 2019
Performance of view with cuArrays GPU	11	2671	November 11, 2018
CUDAnative is awesome! GPU	12	5977	December 3, 2018
[ANN] Stencils.jl for fast small/direct stencils on CPU/GPU Package Announcements stencils	2	562	November 22, 2023

Writing fast stencil computation kernels that work on both CPUs and GPUs

Related topics