I want to use CuArrays to accelerate a Jacobian calculation for a function \Re^m \mapsto \Re^n where n>>m, which seems like a pretty good use case for ForwardDiff. I saw some usage of ForwardDiff in the tests for CuArrays, but I couldn’t figure out how to map what I was seeing in the test file to my (admittedly novice) usage of CuArrays, and there was no documentation on the subject that I could locate.
Here is an extremely simple mwe. IF I allow scalar getindex, it works yet performance seems atrocious (most likely due to user error). Note that the gradient crashes if CuArrays.allowscalar(false), though simple evaluation of the function does not.
using Revise
using Flux, BenchmarkTools, CuArrays, CUDAnative, ForwardDiff, LinearAlgebra, Random
CuArrays.allowscalar(true)
Random.seed!(1111)
function tcudiff(N, ::Type{T} = Float32) where T<:Real
A = rand(T, N, N)
cuA = A |> gpu
f(A) = sum(A .+ A*A .+ T(1))
@info "test f cpu: $(f(A))"
(N<5) && @info "test ∇cpu: $(ForwardDiff.gradient(f, A))"
@btime ForwardDiff.gradient($f, $A)
@info "test f gpu: $(f(cuA))"
(N<5) && @info "test ∇gpu: $(ForwardDiff.gradient(f, cuA))"
@btime ForwardDiff.gradient($f, $cuA)
end
tcudiff(60)
Output:
[ Info: test f cpu: 60024.586
351.611 ms (3004 allocations: 107.45 MiB)
[ Info: test f gpu: 60024.586
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Clinton\.julia\packages\GPUArrays\JucdK\src\host\indexing.jl:43
1.058 s (233939 allocations: 10.05 MiB)
Yeah. It’s using ForwardDiff under the hood, but directly using the duals. The issue is with how the duals are seeded in ForwardDiff.jacobian, while SparseDiffTools.forward_color_jacobian vectorizes this because it naturally works with color vectors for the seeding, which then has a side effect of working on GPUs. The sparse part of it doesn’t work well on GPUs since CuArrays.jl isn’t able to lower one of the broadcast expressions we use (broadcasted getindex), but hey, at least dense GPUs works well there!
This is great- the cached CPU version is quite speedy already. Using CuArrays, even cached, seems to generate more overhead than I expected, so I will need to see how my problem scales once I stop working with the MWE.
One small issue- using a cache with CuArrays generates a segfault on the out-of-place version. Not a big deal since the in-place version seems to work fine, though I am happy to file an issue if that would be useful.
I am running into a similar problem using ForwardDiff and Zygote for Hessian vector products, so I am unable to use SparseDiffTools functionality. Is there a reference you could point to for how I need to change my Dual seeding in order to make this work?
I looked at forward_color_jacobian, but it was unclear what I need to change. Perhaps this is because the changes are quite complicated, or introducing Zygote.pullback creates some issues.
I was using the following code for the hvp operator. I thought I had fixed it by changing one.(val) to one(val) in the second to last line, but now I am having trouble reproducing the issue. Ignore the Nothing tag, I know I shouldn’t be doing that.
#=
In-place hvp operator compatible with Krylov.jl
=#
mutable struct HvpOperator{F, T, I}
f::F
x::AbstractArray{T, 1}
dualCache1::AbstractArray{Dual{Nothing, T, 1}}
size::I
nProd::I
end
function HvpOperator(f, x::AbstractVector)
dualCache1 = Dual.(x, similar(x))
return HvpOperator(f, x, dualCache1, size(x, 1), 0)
end
Base.eltype(op::HvpOperator{F, T, I}) where{F, T, I} = T
Base.size(op::HvpOperator) = (op.size, op.size)
function LinearAlgebra.mul!(result::AbstractVector, op::HvpOperator, v::AbstractVector)
op.nProd += 1
op.dualCache1 .= Dual.(op.x, v)
val, back = pullback(op.f, op.dualCache1)
result .= partials.(back(one.(val))[1], 1)
end
In the GPU context x should be a CUDA array. Like I said, I am having some trouble reproducing the issue, but maybe it is possible to tell if my seeding process should be GPU compatible?