How to use ForwardDiff with CuArrays

clinton · May 15, 2020, 6:03pm

I want to use CuArrays to accelerate a Jacobian calculation for a function \Re^m \mapsto \Re^n where n>>m, which seems like a pretty good use case for ForwardDiff. I saw some usage of ForwardDiff in the tests for CuArrays, but I couldn’t figure out how to map what I was seeing in the test file to my (admittedly novice) usage of CuArrays, and there was no documentation on the subject that I could locate.

Here is an extremely simple mwe. IF I allow scalar getindex, it works yet performance seems atrocious (most likely due to user error). Note that the gradient crashes if CuArrays.allowscalar(false), though simple evaluation of the function does not.

using Revise
using Flux, BenchmarkTools, CuArrays, CUDAnative, ForwardDiff, LinearAlgebra, Random
CuArrays.allowscalar(true)
Random.seed!(1111)

function tcudiff(N, ::Type{T} = Float32) where T<:Real
  A = rand(T, N, N)

  cuA = A |> gpu

  f(A) = sum(A .+ A*A .+ T(1))

  @info "test f cpu: $(f(A))"
  (N<5) && @info "test ∇cpu: $(ForwardDiff.gradient(f, A))"
  @btime ForwardDiff.gradient($f, $A)

  @info "test f gpu: $(f(cuA))"
  (N<5) && @info "test ∇gpu: $(ForwardDiff.gradient(f, cuA))"
  @btime ForwardDiff.gradient($f, $cuA)
end

tcudiff(60)

Output:

[ Info: test f cpu: 60024.586
  351.611 ms (3004 allocations: 107.45 MiB)
[ Info: test f gpu: 60024.586
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Clinton\.julia\packages\GPUArrays\JucdK\src\host\indexing.jl:43
  1.058 s (233939 allocations: 10.05 MiB)

What am I doing wrong?

ChrisRackauckas · May 15, 2020, 6:40pm

ForwardDiff’s seeding isn’t GPU-safe. But you can use SparseDiffTools.jacobian for this (and then transpose)

clinton · May 15, 2020, 8:21pm

I didn’t know that. Does using the SparseDiffTools.jacobian function make using the GPU kosher?

ChrisRackauckas · May 15, 2020, 8:29pm

Yeah. It’s using ForwardDiff under the hood, but directly using the duals. The issue is with how the duals are seeded in ForwardDiff.jacobian, while SparseDiffTools.forward_color_jacobian vectorizes this because it naturally works with color vectors for the seeding, which then has a side effect of working on GPUs. The sparse part of it doesn’t work well on GPUs since CuArrays.jl isn’t able to lower one of the broadcast expressions we use (broadcasted getindex), but hey, at least dense GPUs works well there!

clinton · May 16, 2020, 1:00am

This is great- the cached CPU version is quite speedy already. Using CuArrays, even cached, seems to generate more overhead than I expected, so I will need to see how my problem scales once I stop working with the MWE.

One small issue- using a cache with CuArrays generates a segfault on the out-of-place version. Not a big deal since the in-place version seems to work fine, though I am happy to file an issue if that would be useful.

ChrisRackauckas · May 16, 2020, 2:44am

Yup, open an issue please.

clinton · May 16, 2020, 3:54am

Done- see 108

RS-Coop · January 24, 2022, 5:09pm

I am running into a similar problem using ForwardDiff and Zygote for Hessian vector products, so I am unable to use SparseDiffTools functionality. Is there a reference you could point to for how I need to change my Dual seeding in order to make this work?

I looked at forward_color_jacobian, but it was unclear what I need to change. Perhaps this is because the changes are quite complicated, or introducing Zygote.pullback creates some issues.

ChrisRackauckas · January 24, 2022, 5:36pm

Hessian-vector products does not need forward_color_jacobian. You can just direct seed. Can you show what you ran?

RS-Coop · January 25, 2022, 12:34am

I was using the following code for the hvp operator. I thought I had fixed it by changing one.(val) to one(val) in the second to last line, but now I am having trouble reproducing the issue. Ignore the Nothing tag, I know I shouldn’t be doing that.

#=
In-place hvp operator compatible with Krylov.jl
=#
mutable struct HvpOperator{F, T, I}
	f::F
	x::AbstractArray{T, 1}
	dualCache1::AbstractArray{Dual{Nothing, T, 1}}
	size::I
	nProd::I
end

function HvpOperator(f, x::AbstractVector)
	dualCache1 = Dual.(x, similar(x))
	return HvpOperator(f, x, dualCache1, size(x, 1), 0)
end

Base.eltype(op::HvpOperator{F, T, I}) where{F, T, I} = T
Base.size(op::HvpOperator) = (op.size, op.size)

function LinearAlgebra.mul!(result::AbstractVector, op::HvpOperator, v::AbstractVector)
	op.nProd += 1

	op.dualCache1 .= Dual.(op.x, v)
	val, back = pullback(op.f, op.dualCache1)

	result .= partials.(back(one.(val))[1], 1)
end

In the GPU context x should be a CUDA array. Like I said, I am having some trouble reproducing the issue, but maybe it is possible to tell if my seeding process should be GPU compatible?

ChrisRackauckas · January 25, 2022, 1:28am

Oh this has nothing to do with the original post. Make a different thread.

RS-Coop · January 26, 2022, 3:55am

Currently having some gpu issues preventing me from recreating the issue, but if it comes up again I will do so! Sorry for extending this thread.

Topic		Replies	Views
ForwardDiff of function with internal derivatives using CuArrays GPU forwarddiff	8	874	March 31, 2021
How compute `gradient.(f, w)` on GPU? Machine Learning first-steps	2	912	November 16, 2019
Hessian Vector Products on GPU using ForwardDiff, Zygote, and Flux Machine Learning gpu , cuda , flux , zygote , forwarddiff	13	1336	January 31, 2022
Could Optim.TwiceDifferentiable support CuArray? Optimization (Mathematical) question , cuda , optim , optimization	5	863	March 12, 2022
Avoiding redundant computation when using Zygote and ForwardDiff for forward-over-reverse hessians General Usage forwarddiff , reversediff , autodiff	4	70	March 12, 2025

How to use ForwardDiff with CuArrays

Related topics