How to use ForwardDiff with CuArrays

I want to use CuArrays to accelerate a Jacobian calculation for a function \Re^m \mapsto \Re^n where n>>m, which seems like a pretty good use case for ForwardDiff. I saw some usage of ForwardDiff in the tests for CuArrays, but I couldn’t figure out how to map what I was seeing in the test file to my (admittedly novice) usage of CuArrays, and there was no documentation on the subject that I could locate.

Here is an extremely simple mwe. IF I allow scalar getindex, it works yet performance seems atrocious (most likely due to user error). Note that the gradient crashes if CuArrays.allowscalar(false), though simple evaluation of the function does not.

using Revise
using Flux, BenchmarkTools, CuArrays, CUDAnative, ForwardDiff, LinearAlgebra, Random

function tcudiff(N, ::Type{T} = Float32) where T<:Real
  A = rand(T, N, N)

  cuA = A |> gpu

  f(A) = sum(A .+ A*A .+ T(1))

  @info "test f cpu: $(f(A))"
  (N<5) && @info "test ∇cpu: $(ForwardDiff.gradient(f, A))"
  @btime ForwardDiff.gradient($f, $A)

  @info "test f gpu: $(f(cuA))"
  (N<5) && @info "test ∇gpu: $(ForwardDiff.gradient(f, cuA))"
  @btime ForwardDiff.gradient($f, $cuA)



[ Info: test f cpu: 60024.586
  351.611 ms (3004 allocations: 107.45 MiB)
[ Info: test f gpu: 60024.586
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Clinton\.julia\packages\GPUArrays\JucdK\src\host\indexing.jl:43
  1.058 s (233939 allocations: 10.05 MiB)

What am I doing wrong?

ForwardDiff’s seeding isn’t GPU-safe. But you can use SparseDiffTools.jacobian for this (and then transpose)

I didn’t know that. Does using the SparseDiffTools.jacobian function make using the GPU kosher?

Yeah. It’s using ForwardDiff under the hood, but directly using the duals. The issue is with how the duals are seeded in ForwardDiff.jacobian, while SparseDiffTools.forward_color_jacobian vectorizes this because it naturally works with color vectors for the seeding, which then has a side effect of working on GPUs. The sparse part of it doesn’t work well on GPUs since CuArrays.jl isn’t able to lower one of the broadcast expressions we use (broadcasted getindex), but hey, at least dense GPUs works well there!


This is great- the cached CPU version is quite speedy already. Using CuArrays, even cached, seems to generate more overhead than I expected, so I will need to see how my problem scales once I stop working with the MWE.

One small issue- using a cache with CuArrays generates a segfault on the out-of-place version. Not a big deal since the in-place version seems to work fine, though I am happy to file an issue if that would be useful.

Yup, open an issue please.

Done- see 108