Writing stencils for CuArray

I am adapting a PDE solver to work on GPU. Being lazy, I was hoping that something like

using CuArrays

A = rand(5, 5, 5) |> CuArray{Float64}
B = similar(A)
Is = CartesianIndices((2:4, 2:4))
@views @. B[Is.indices..., :] = A[Is.indices..., :]

would work (it works well on CPU), but instead it gives me a MethodError: no method matching Array(::Tuple{UnitRange{Int64},UnitRange{Int64},Base.Slice{Base.OneTo{Int64}}}). Looking at subarray.jl in CuArrays, it looks like this is caused by the fallback to SubArray when the memory is non-contiguous. Indeed, view-based broadcasting works without issue when the view is contiguous.

A few questions:

  1. Are there plans to make the above behavior work?
  2. I should probably just implement an optimized stencil for my code anyway. I have seen Tim Besard’s slides and a previous discussion, but they are light on details on how to write stencils using CudaNative. Does anyone have suggestions of literature or examples in the wild?

EDIT: I just discovered GPUifyLoops.jl. I have a loop-version of my code, so I’ll give it a try and report back. Still interested in hearing anyone’s experience coding stencils by hand in CUnative or with GPUifyLoops

The main customer for GPUifyLoops is Oceananigans, so it’s worth checking out how their kernels are implemented.

Update:

Thanks for the link, I got it to work with GPUifyLoops pretty easily using that as a scaffold (mostly for debugging reasons, the tutorials across the GPU libraries are fairly… sparse). It seems to give the speedups I was expecting, though I’m sure there’s more performance I could squeeze out if I needed to by dropping down to kernels.

The biggest issue I ran into was understanding where and how to call the GPUified functions, if done incorrectly it gave me worldage issues (due to the way I was writing a function closure and calling the code). The error was a vague tt not defined in the internals of CUDAnative, in case anyone runs into this problem in the future.

Errors like that are almost always bugs, so please file issues for them.

Hi @platawiec I’m the main developer of Oceananigans.jl and have been using GPUifyLoops.jl for a while with great results. Apologies for the messy code if you looked at it, we’re still in development/bootstrap mode!

You probably already know this out but depending on the PDE you’re trying to solve you should be able to fully utilize the GPU, as long as your numerical methods parallelize well. I had to write the operators to act element-wise (e.g. calculate the x-derivative at i,j,k, calculate viscosity at i,j,k, etc.) so that they could be used by each thread, then tried to cram as many operations into each kernel.

Should work great for hyperbolic PDEs (we explicitly time step a form of Navier-Stokes), and we solve an elliptic problem with FFTs which are thankfully super-fast on the GPU (Poisson equation for the pressure field).

I have a pretty old PR open with an example showing how to calculate the divergence of a vector field with GPUifyLoops.jl in case anyone finds it helpful: https://github.com/vchuravy/GPUifyLoops.jl/pull/18

For kernels with large stencils or heavy memory access patterns, they seem to require lots of registers which means you can’t saturate the GPU with threads as each thread requires too many registers. With the help of @maleadt and @vchuravy we’ve been experimenting with using shared memory for these kernels, but haven’t implemented anything complete. There’s an open PR in GPUifyLoops with an @stencil abstraction if you’re interested: https://github.com/vchuravy/GPUifyLoops.jl/pull/81

Is your code open source? Would love to see another code doing GPU stencils in Julia!

I tried to reproduce the error but didn’t have any luck - I think I was making some syntax mistake, and now I can’t remember exactly how. I agree it would have been nice for me to file the issue, even if only for a better error message, so I’ll try to be more diligent next time.

Thanks for the response, and your work on Oceananigans! I got it to work with some peeking at your code. For now I’m happy with the speed ups, but looking ahead I think that there’s performance gains to be had with a more sophisticated implementation.

Even though I have some nominal idea of how GPU architecture works, I think what I need help understanding is how that translates to optimal array/memory layout, optimal operation order, or the necessity of using shared memory.