Writing fast stencil computation kernels that work on both CPUs and GPUs


#1

Hi everyone, I am developing a 3D finite volume ocean model in Julia (Oceananigans.jl).

For now I have the model running reasonably fast on a CPU and I’m trying to get it to run efficiently on a GPU using CUDAnative and CuArrays. I’d really like to have the same code/kernel for both the CPU and GPU (with zero memory allocations) but I’m not sure how to do this as most operators involve stencil-like computations.

Almost minimal working example: a global operator that calculates an x-difference (with periodic boundary conditions) of a 3D array f of size (Nx, Ny, Nz) and storing it in δxf might look like

function δx!(f, δxf, Nx, Ny, Nz)
    for k in 1:Nz, j in 1:Ny
        for i in 1:(Nx-1)
            @inbounds δxf[i, j, k] =  f[i+1, j, k] - f[i, j, k]
        end
        @inbounds δxf[end, j, k] =  f[1, j, k] - f[end, j, k]
    end
    nothing
end

Benchmarking it shows it running decently fast on a CPU but it’s terrible on a GPU

~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!(xc, yc, Nx, Ny, Nz)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.381 ms (0.00% GC)
  median time:      1.391 ms (0.00% GC)
  mean time:        1.393 ms (0.00% GC)
  maximum time:     3.173 ms (0.00% GC)
  --------------
  samples:          3580
  evals/sample:     1

~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)

==54821== Profiling application: /home/alir/julia/bin/julia
==54821== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  590.15ms         2  295.08ms  294.44ms  295.72ms  ptxcall__x__1

==54821== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.85%  9.3003ms         1  9.3003ms  9.3003ms  9.3003ms  cuModuleLoadDataEx
  0.86%  81.203us         2  40.601us  40.423us  40.780us  cuLaunchKernel
  0.08%  7.6330us         1  7.6330us  7.6330us  7.6330us  cuProfilerStart
  0.08%  7.6270us         4  1.9060us     303ns  4.2190us  cuCtxGetCurrent
  0.06%  5.6430us         1  5.6430us  5.6430us  5.6430us  cuDeviceGetCount
  0.04%  3.7120us         4     928ns     196ns  2.3920us  cuDeviceGetAttribute
  0.01%     863ns         1     863ns     863ns     863ns  cuModuleGetFunction
  0.01%     758ns         2     379ns     314ns     444ns  cuDeviceGet
  0.01%     625ns         1     625ns     625ns     625ns  cuCtxGetDevice

I expected it to be slow because one thread is doing all the work (?) and there’s no room for parallelization.

Then I tried using an element-wise or local operator that calculates the x-difference for a single element of f at [i, j, k] and returns it.

# Not sure how to code an element-wise δx without branching.
@inline incmod1(a, n) = a == n ? 1 : a+1
δx(f, Nx, i, j, k) = f[incmod1(i, Nx), j, k] - f[i, j, k]

# Basically just applies δx to all elements.
function time_stepping(f, δxf)
    Nx, Ny, Nz = size(f)
    for k in 1:Nz, j in 1:Ny, i in 1:Nx
        δxf[i, j, k] = δx(f, Nx, i, j, k)
    end
end

but this is even slower

~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark time_stepping(xc, yc)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.377 ms (0.00% GC)
  median time:      3.387 ms (0.00% GC)
  mean time:        3.449 ms (0.00% GC)
  maximum time:     3.865 ms (0.00% GC)
  --------------
  samples:          1449
  evals/sample:     1

~ nvprof --profile-from-start off /home/alir/julia/bin/julia
julia> using CUDAnative, CUDAdrv, CuArrays
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(rand(Nx, Ny, Nz));
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)
julia> CUDAdrv.@profile @cuda threads=128 time_stepping(xg, yg)

==55354== Profiling application: /home/alir/julia/bin/julia
==55354== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.86728s         2  1.43364s  1.42919s  1.43808s  ptxcall_time_stepping_2

==55354== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.82%  98.188ms         1  98.188ms  98.188ms  98.188ms  cuModuleLoadDataEx
  0.14%  138.82us         2  69.411us  51.497us  87.326us  cuLaunchKernel
  0.02%  14.819us         1  14.819us  14.819us  14.819us  cuProfilerStart
  0.01%  11.707us         6  1.9510us     173ns  9.1930us  cuDeviceGetAttribute
  0.01%  7.3640us         6  1.2270us     200ns  2.9550us  cuCtxGetCurrent
  0.00%  4.5960us         1  4.5960us  4.5960us  4.5960us  cuDeviceGetCount
  0.00%     977ns         2     488ns     486ns     491ns  cuCtxGetDevice
  0.00%     930ns         1     930ns     930ns     930ns  cuModuleGetFunction
  0.00%     807ns         2     403ns     288ns     519ns  cuDeviceGet

I know I can get the GPU kernel running much faster if I explicitly use block and threads but then I don’t think it would run on a CPU.

Other approaches I considered:

  1. I’ve seen other threads around here that suggest using CuArrays.@sync map!(...) but I don’t think stencil operations can be mapped.
  2. From reading CUDA tutorials on stencil computations, I think the usual CUDA way of doing this is by using 16*16 thread blocks and careful shared memory management but I don’t think a kernel written this way would work on a CPU.
  3. I’ve also been looking at Casette.jl, maybe it’s useful here in case different kernels must be written for the CPU and GPU? Maybe the “API” doesn’t need to know?

If it matters, I am using an Nvidia Tesla P100 and Julia 1.0.1 (CUDAapi v0.5.3, CUDAdrv v0.8.6, CUDAnative v0.9.1, CuArrays v0.8.1).

Apologies for the wall of text! All this GPU stuff is new to me so please excuse me if I’ve made some rookie mistakes.


#2

I think I may have found an answer using @views after skimming through Tim’s awesome presentation!

function δx!(f, δxf)
    @views @. δxf[2:end, :, :] = f[2:end, :, :] - f[1:end-1, :, :]
    @views @. δxf[end,   :, :] = f[1,     :, :] - f[end,     :, :]
    nothing
end

which is almost as fast on the CPU and gives a decent ~10x speedup on a GPU

~ julia
julia> using BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xc = rand(Nx, Ny, Nz); yc = zeros(Nx, Ny, Nz);
julia> @benchmark δx!($xc, $yc)
BenchmarkTools.Trial:
  memory estimate:  384 bytes
  allocs estimate:  6
  --------------
  minimum time:     1.645 ms (0.00% GC)
  median time:      1.662 ms (0.00% GC)
  mean time:        1.667 ms (0.00% GC)
  maximum time:     3.740 ms (0.00% GC)
  --------------
  samples:          2996
  evals/sample:     1

~ julia
julia> using CUDAnative, CUDAdrv, CuArrays, BenchmarkTools
julia> Nx, Ny, Nz = 128, 128, 128;
julia> xg = cu(rand(Nx, Ny, Nz)); yg = cu(zeros(Nx, Ny, Nz));
julia> @benchmark δx!($xg, $yg)
BenchmarkTools.Trial:
  memory estimate:  4.84 KiB
  allocs estimate:  70
  --------------
  minimum time:     9.721 μs (0.00% GC)
  median time:      178.504 μs (0.00% GC)
  mean time:        169.339 μs (0.27% GC)
  maximum time:     2.521 ms (99.15% GC)
  --------------
  samples:          10000
  evals/sample:     1

although I’m interested to look into why there’s such a big spread in run times. And apparently @inbounds slows things down on the CPU.

I know double posting is somewhat taboo but I figured this is closer to a solution than an edit.


#3

In your kernels, I don’t see any threadIdx anywhere, so you’re just doing the same serial computation on 128 threads? You need to parallelize manually, eg. map your 128x128x128 over the block/grid configuration and have every thread perform what’s basically a single iteration of the for loop. There’s no autoparallelizing magic going on here, just adding @cuda threads=... won’t make something parallel.

Alternatively, you can use abstractions that have this parallelization built in, such as broadcast. Your last solution uses two fused broadcast kernels like that. Since it’s a very simple stencil kernel though, there’s both some overhead from using broadcast, as well as the fact that you’re using two kernels. Fusing these together in an explicit CUDAnative stencil kernel should easily get you another 2x.

EDIT: gave it a quick go

julia> function kernel(f, δxf)
         Nx, Ny, Nz = size(f)
         k = threadIdx().x
         j = blockIdx().x
         i = blockIdx().y
         δxf[i, j, k] = δx(f, Nx, i, j, k)
         return
       end
julia> @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)

julia> @test yg ≈ yc
Test Passed

julia> @benchmark CuArrays.@sync @cuda threads=Nx blocks=(Ny, Nz) kernel(xg,yg)
BenchmarkTools.Trial: 
  memory estimate:  1.31 KiB
  allocs estimate:  31
  --------------
  minimum time:     8.043 μs (0.00% GC)
  median time:      6.456 ms (0.00% GC)
  mean time:        6.266 ms (0.00% GC)
  maximum time:     6.957 ms (0.00% GC)
  --------------
  samples:          266
  evals/sample:     3

Different hardware, different characteristics. But this only uses 128 threads which is heavily underusing my GPU, you’d generally (for trivial kernels like this) want to max out the thread count that your device supports (typically 512 or 2014) and do a more elaborate computation between threads/blocks to figure out the indices.
Furthermore, implementing stencils efficiently has been researched heavily so there’s loads of possible improvements.


Generic CPU and GPU loops with tiled iteration
#4

Thanks for the example @maleadt and for devoting so much of your time to building up the GPU/CUDA infrastructure in Julia!

I must have been trying so hard to get a CPU/GPU shared kernel working that I forgot @cuda threads=... won’t automatically parallelize code without assigning threads/blocks.

Indeed I just tried out your kernel and @benchmark gives a median time of 1.614 ms with 128 threads (compared with 178.504 μs using @views). I’m sure it could be made faster as you say but I’m interested in seeing how long I can stick to @views and avoid planning out messy/elaborate thread/block kernels. I don’t mind sacrificing a bit of performance in favor of more readable and generic code. I will try fusing some smaller operations and see how well @views performs on a much bigger kernel which should hopefully help bring down the broadcasting overhead.

The big issue is that unfortunately I cannot use threadIdx(), blockIdx(), etc. as the kernel will not run on a CPU anymore, so I felt like the vast majority of research and tutorials out there on implementing efficient stencil calculations won’t help too much.

But if I can efficiently implement all the operators using @views that would be excellent! Then the same code should run on both CPUs (using regular Arrays) and GPUs (using CuArrays). Shouldn’t be too much work so I’ll give it a try.