How to use ForwardDiff with CuArrays

I want to use CuArrays to accelerate a Jacobian calculation for a function \Re^m \mapsto \Re^n where n>>m, which seems like a pretty good use case for ForwardDiff. I saw some usage of ForwardDiff in the tests for CuArrays, but I couldn’t figure out how to map what I was seeing in the test file to my (admittedly novice) usage of CuArrays, and there was no documentation on the subject that I could locate.

Here is an extremely simple mwe. IF I allow scalar getindex, it works yet performance seems atrocious (most likely due to user error). Note that the gradient crashes if CuArrays.allowscalar(false), though simple evaluation of the function does not.

using Revise
using Flux, BenchmarkTools, CuArrays, CUDAnative, ForwardDiff, LinearAlgebra, Random

function tcudiff(N, ::Type{T} = Float32) where T<:Real
  A = rand(T, N, N)

  cuA = A |> gpu

  f(A) = sum(A .+ A*A .+ T(1))

  @info "test f cpu: $(f(A))"
  (N<5) && @info "test ∇cpu: $(ForwardDiff.gradient(f, A))"
  @btime ForwardDiff.gradient($f, $A)

  @info "test f gpu: $(f(cuA))"
  (N<5) && @info "test ∇gpu: $(ForwardDiff.gradient(f, cuA))"
  @btime ForwardDiff.gradient($f, $cuA)



[ Info: test f cpu: 60024.586
  351.611 ms (3004 allocations: 107.45 MiB)
[ Info: test f gpu: 60024.586
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Clinton\.julia\packages\GPUArrays\JucdK\src\host\indexing.jl:43
  1.058 s (233939 allocations: 10.05 MiB)

What am I doing wrong?

ForwardDiff’s seeding isn’t GPU-safe. But you can use SparseDiffTools.jacobian for this (and then transpose)

I didn’t know that. Does using the SparseDiffTools.jacobian function make using the GPU kosher?

Yeah. It’s using ForwardDiff under the hood, but directly using the duals. The issue is with how the duals are seeded in ForwardDiff.jacobian, while SparseDiffTools.forward_color_jacobian vectorizes this because it naturally works with color vectors for the seeding, which then has a side effect of working on GPUs. The sparse part of it doesn’t work well on GPUs since CuArrays.jl isn’t able to lower one of the broadcast expressions we use (broadcasted getindex), but hey, at least dense GPUs works well there!


This is great- the cached CPU version is quite speedy already. Using CuArrays, even cached, seems to generate more overhead than I expected, so I will need to see how my problem scales once I stop working with the MWE.

One small issue- using a cache with CuArrays generates a segfault on the out-of-place version. Not a big deal since the in-place version seems to work fine, though I am happy to file an issue if that would be useful.

Yup, open an issue please.

Done- see 108

I am running into a similar problem using ForwardDiff and Zygote for Hessian vector products, so I am unable to use SparseDiffTools functionality. Is there a reference you could point to for how I need to change my Dual seeding in order to make this work?

I looked at forward_color_jacobian, but it was unclear what I need to change. Perhaps this is because the changes are quite complicated, or introducing Zygote.pullback creates some issues.

Hessian-vector products does not need forward_color_jacobian. You can just direct seed. Can you show what you ran?

I was using the following code for the hvp operator. I thought I had fixed it by changing one.(val) to one(val) in the second to last line, but now I am having trouble reproducing the issue. Ignore the Nothing tag, I know I shouldn’t be doing that.

In-place hvp operator compatible with Krylov.jl
mutable struct HvpOperator{F, T, I}
	x::AbstractArray{T, 1}
	dualCache1::AbstractArray{Dual{Nothing, T, 1}}

function HvpOperator(f, x::AbstractVector)
	dualCache1 = Dual.(x, similar(x))
	return HvpOperator(f, x, dualCache1, size(x, 1), 0)

Base.eltype(op::HvpOperator{F, T, I}) where{F, T, I} = T
Base.size(op::HvpOperator) = (op.size, op.size)

function LinearAlgebra.mul!(result::AbstractVector, op::HvpOperator, v::AbstractVector)
	op.nProd += 1

	op.dualCache1 .= Dual.(op.x, v)
	val, back = pullback(op.f, op.dualCache1)

	result .= partials.(back(one.(val))[1], 1)

In the GPU context x should be a CUDA array. Like I said, I am having some trouble reproducing the issue, but maybe it is possible to tell if my seeding process should be GPU compatible?

Oh this has nothing to do with the original post. Make a different thread.

Currently having some gpu issues preventing me from recreating the issue, but if it comes up again I will do so! Sorry for extending this thread.