# Writing fast stencil computation kernels that work on both CPUs and GPUs

Hi everyone, I am developing a 3D finite volume ocean model in Julia (Oceananigans.jl).

For now I have the model running reasonably fast on a CPU and I’m trying to get it to run efficiently on a GPU using CUDAnative and CuArrays. I’d really like to have the same code/kernel for both the CPU and GPU (with zero memory allocations) but I’m not sure how to do this as most operators involve stencil-like computations.

Almost minimal working example: a global operator that calculates an x-difference (with periodic boundary conditions) of a 3D array `f` of size `(Nx, Ny, Nz)` and storing it in `δxf` might look like

``````function δx!(f, δxf, Nx, Ny, Nz)
for k in 1:Nz, j in 1:Ny
for i in 1:(Nx-1)
@inbounds δxf[i, j, k] =  f[i+1, j, k] - f[i, j, k]
end
@inbounds δxf[end, j, k] =  f[1, j, k] - f[end, j, k]
end
nothing
end
``````

Benchmarking it shows it running decently fast on a CPU but it’s terrible on a GPU

``````~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!(xc, yc, Nx, Ny, Nz)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.381 ms (0.00% GC)
median time:      1.391 ms (0.00% GC)
mean time:        1.393 ms (0.00% GC)
maximum time:     3.173 ms (0.00% GC)
--------------
samples:          3580
evals/sample:     1

~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));

==54821== Profiling application: /home/alir/julia/bin/julia
==54821== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  590.15ms         2  295.08ms  294.44ms  295.72ms  ptxcall__x__1

==54821== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
98.85%  9.3003ms         1  9.3003ms  9.3003ms  9.3003ms  cuModuleLoadDataEx
0.86%  81.203us         2  40.601us  40.423us  40.780us  cuLaunchKernel
0.08%  7.6330us         1  7.6330us  7.6330us  7.6330us  cuProfilerStart
0.08%  7.6270us         4  1.9060us     303ns  4.2190us  cuCtxGetCurrent
0.06%  5.6430us         1  5.6430us  5.6430us  5.6430us  cuDeviceGetCount
0.04%  3.7120us         4     928ns     196ns  2.3920us  cuDeviceGetAttribute
0.01%     863ns         1     863ns     863ns     863ns  cuModuleGetFunction
0.01%     758ns         2     379ns     314ns     444ns  cuDeviceGet
0.01%     625ns         1     625ns     625ns     625ns  cuCtxGetDevice
``````

I expected it to be slow because one thread is doing all the work (?) and there’s no room for parallelization.

Then I tried using an element-wise or local operator that calculates the x-difference for a single element of `f` at `[i, j, k]` and returns it.

``````# Not sure how to code an element-wise δx without branching.
@inline incmod1(a, n) = a == n ? 1 : a+1
δx(f, Nx, i, j, k) = f[incmod1(i, Nx), j, k] - f[i, j, k]

# Basically just applies δx to all elements.
function time_stepping(f, δxf)
Nx, Ny, Nz = size(f)
for k in 1:Nz, j in 1:Ny, i in 1:Nx
δxf[i, j, k] = δx(f, Nx, i, j, k)
end
end
``````

but this is even slower

``````~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark time_stepping(xc, yc)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.377 ms (0.00% GC)
median time:      3.387 ms (0.00% GC)
mean time:        3.449 ms (0.00% GC)
maximum time:     3.865 ms (0.00% GC)
--------------
samples:          1449
evals/sample:     1

~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));

==55354== Profiling application: /home/alir/julia/bin/julia
==55354== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.86728s         2  1.43364s  1.42919s  1.43808s  ptxcall_time_stepping_2

==55354== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
99.82%  98.188ms         1  98.188ms  98.188ms  98.188ms  cuModuleLoadDataEx
0.14%  138.82us         2  69.411us  51.497us  87.326us  cuLaunchKernel
0.02%  14.819us         1  14.819us  14.819us  14.819us  cuProfilerStart
0.01%  11.707us         6  1.9510us     173ns  9.1930us  cuDeviceGetAttribute
0.01%  7.3640us         6  1.2270us     200ns  2.9550us  cuCtxGetCurrent
0.00%  4.5960us         1  4.5960us  4.5960us  4.5960us  cuDeviceGetCount
0.00%     977ns         2     488ns     486ns     491ns  cuCtxGetDevice
0.00%     930ns         1     930ns     930ns     930ns  cuModuleGetFunction
0.00%     807ns         2     403ns     288ns     519ns  cuDeviceGet
``````

I know I can get the GPU kernel running much faster if I explicitly use block and threads but then I don’t think it would run on a CPU.

Other approaches I considered:

1. I’ve seen other threads around here that suggest using `CuArrays.@sync map!(...)` but I don’t think stencil operations can be mapped.
2. From reading CUDA tutorials on stencil computations, I think the usual CUDA way of doing this is by using 16*16 thread blocks and careful shared memory management but I don’t think a kernel written this way would work on a CPU.
3. I’ve also been looking at Casette.jl, maybe it’s useful here in case different kernels must be written for the CPU and GPU? Maybe the “API” doesn’t need to know?

If it matters, I am using an Nvidia Tesla P100 and Julia 1.0.1 (CUDAapi v0.5.3, CUDAdrv v0.8.6, CUDAnative v0.9.1, CuArrays v0.8.1).

Apologies for the wall of text! All this GPU stuff is new to me so please excuse me if I’ve made some rookie mistakes.

2 Likes

I think I may have found an answer using `@views` after skimming through Tim’s awesome presentation!

``````function δx!(f, δxf)
@views @. δxf[2:end, :, :] = f[2:end, :, :] - f[1:end-1, :, :]
@views @. δxf[end,   :, :] = f[1,     :, :] - f[end,     :, :]
nothing
end
``````

which is almost as fast on the CPU and gives a decent ~10x speedup on a GPU

``````~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!(\$xc, \$yc)
BenchmarkTools.Trial:
memory estimate:  384 bytes
allocs estimate:  6
--------------
minimum time:     1.645 ms (0.00% GC)
median time:      1.662 ms (0.00% GC)
mean time:        1.667 ms (0.00% GC)
maximum time:     3.740 ms (0.00% GC)
--------------
samples:          2996
evals/sample:     1

~ julia
julia> using CUDAnative, CUDAdrv, CuArrays, BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(zeros(Nx, Ny, Nz));
julia> @benchmark δx!(\$xg, \$yg)
BenchmarkTools.Trial:
memory estimate:  4.84 KiB
allocs estimate:  70
--------------
minimum time:     9.721 μs (0.00% GC)
median time:      178.504 μs (0.00% GC)
mean time:        169.339 μs (0.27% GC)
maximum time:     2.521 ms (99.15% GC)
--------------
samples:          10000
evals/sample:     1
``````

although I’m interested to look into why there’s such a big spread in run times. And apparently `@inbounds` slows things down on the CPU.

I know double posting is somewhat taboo but I figured this is closer to a solution than an edit.

2 Likes

In your kernels, I don’t see any `threadIdx` anywhere, so you’re just doing the same serial computation on 128 threads? You need to parallelize manually, eg. map your 128x128x128 over the block/grid configuration and have every thread perform what’s basically a single iteration of the for loop. There’s no autoparallelizing magic going on here, just adding `@cuda threads=...` won’t make something parallel.

Alternatively, you can use abstractions that have this parallelization built in, such as broadcast. Your last solution uses two fused broadcast kernels like that. Since it’s a very simple stencil kernel though, there’s both some overhead from using broadcast, as well as the fact that you’re using two kernels. Fusing these together in an explicit CUDAnative stencil kernel should easily get you another 2x.

EDIT: gave it a quick go

``````julia> function kernel(f, δxf)
Nx, Ny, Nz = size(f)
j = blockIdx().x
i = blockIdx().y
δxf[i, j, k] = δx(f, Nx, i, j, k)
return
end
julia> @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)

julia> @test yg ≈ yc
Test Passed

julia> @benchmark CuArrays.@sync @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)
BenchmarkTools.Trial:
memory estimate:  1.31 KiB
allocs estimate:  31
--------------
minimum time:     8.043 μs (0.00% GC)
median time:      6.456 ms (0.00% GC)
mean time:        6.266 ms (0.00% GC)
maximum time:     6.957 ms (0.00% GC)
--------------
samples:          266
evals/sample:     3
``````

Different hardware, different characteristics. But this only uses 128 threads which is heavily underusing my GPU, you’d generally (for trivial kernels like this) want to max out the thread count that your device supports (typically 512 or 2014) and do a more elaborate computation between threads/blocks to figure out the indices.
Furthermore, implementing stencils efficiently has been researched heavily so there’s loads of possible improvements.

5 Likes

Thanks for the example @maleadt and for devoting so much of your time to building up the GPU/CUDA infrastructure in Julia!

I must have been trying so hard to get a CPU/GPU shared kernel working that I forgot `@cuda threads=...` won’t automatically parallelize code without assigning threads/blocks.

Indeed I just tried out your kernel and `@benchmark` gives a median time of 1.614 ms with 128 threads (compared with 178.504 μs using `@views`). I’m sure it could be made faster as you say but I’m interested in seeing how long I can stick to `@views` and avoid planning out messy/elaborate thread/block kernels. I don’t mind sacrificing a bit of performance in favor of more readable and generic code. I will try fusing some smaller operations and see how well `@views` performs on a much bigger kernel which should hopefully help bring down the broadcasting overhead.

The big issue is that unfortunately I cannot use `threadIdx()`, `blockIdx()`, etc. as the kernel will not run on a CPU anymore, so I felt like the vast majority of research and tutorials out there on implementing efficient stencil calculations won’t help too much.

But if I can efficiently implement all the operators using `@views` that would be excellent! Then the same code should run on both CPUs (using regular Arrays) and GPUs (using CuArrays). Shouldn’t be too much work so I’ll give it a try.