I’m solving a semi-discretized PDE on my GPU and one of the steps involves performing the Laplacian via a 5-point stencil. My current method does this with a CUDA kernel, but I also want to perform automatic differentiation so that I can use a stiff ODE solver. It seems pretty clear to me that a ForwardDiff dual will not work with a CUDA kernel because it’s not a bitstype.

Are there any recommended ways to either get automatic differentiation to work with a generic CUDA kernel or a better way to implement a stencil on a GPU that is compatible with ForwardDiff?

Thanks in advance.

Handwritten CUDA.jl kernels will recompile with different number types so that’s fine. Though stencils are linear operators so directly defining their derivatives isn’t hard: I’d just define the derivative overload for NNLib.jl’s `conv`

and use it.

You could try things like `@tullio y[i] := -x[i] + 2x[i+1] - x[i+2]`

, I believe such cases ought to be fairly efficient (including GPU & derivatives).

(You may also be interested in ParallelStencil.jl but not sure this will help with derivatives.)

Not using ForwardDiff though, you could have a look at using Enzyme.jl for differentiating GPU kernels. An example can be found here GitHub - PTsolvers/PT-AD: Pseudo-transient auto-diff playground. There we use Enzyme to get the Jacobian vector product and gradient of the cost function from AD in the scope of an adjoint-based gradient descent inversion procedure having a nonlinear diffusion equation as forward problem.

1 Like