How compute `gradient.(f, w)` on GPU?

I can’t seem to figure out how to use the CuArray to compute the gradient of the same function f at many points x.

E.g.

using Zygote:gradient
f(x) = x^2

x = rand(100)
gradient.(f, x)

the above works fine becuase it’s CPU code

But the below doesn’t work

using CuArrays
allowscalar(false)
x = CuArray(rand(100))

gradient.(f, x)
ERROR: MethodError: no method matching operands(::LLVM.Argument)
Closest candidates are:
  operands(::LLVM.MetadataAsValue) at C:\Users\RTX2080\.julia\packages\LLVM\ICZSf\src\core\metadata.jl:35
  operands(::LLVM.User) at C:\Users\RTX2080\.julia\packages\LLVM\ICZSf\src\core\value\user.jl:13
Stacktrace:
 [1] check_ir!(::CUDAnative.CompilerJob, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.CallInst) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\validation.jl:212
 [2] check_ir!(::CUDAnative.CompilerJob, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.Function) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\validation.jl:131
 [3] check_ir!(::CUDAnative.CompilerJob, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.Module) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\validation.jl:122
 [4] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\validation.jl:111
 [5] macro expansion at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\driver.jl:188 [inlined]
 [6] macro expansion at C:\Users\RTX2080\.julia\packages\TimerOutputs\ohPOH\src\TimerOutput.jl:197 [inlined]
 [7] #codegen#152(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\driver.jl:186
 [8] #codegen at .\none:0 [inlined]
 [9] #compile#151(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\driver.jl:47
 [10] #compile at .\none:0 [inlined]
 [11] #compile#150 at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\compiler\driver.jl:28 [inlined]
 [12] #compile at .\none:0 [inlined] (repeats 2 times)
 [13] macro expansion at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\execution.jl:403 [inlined]
 [14] #cufunction#194(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::GPUArrays.var"#25#26", ::Type{Tuple{CuArrays.CuKernelState,CUDAnative.CuDeviceArray{Tuple{Tracker.TrackedReal{Float64}},1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(gradient),Tuple{CUDAnative.CuRefValue{typeof(f)},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\execution.jl:368
 [15] cufunction(::Function, ::Type) at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\execution.jl:368
 [16] macro expansion at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\execution.jl:176 [inlined]
 [17] macro expansion at .\gcutils.jl:91 [inlined]
 [18] macro expansion at C:\Users\RTX2080\.julia\packages\CUDAnative\3Jwj2\src\execution.jl:173 [inlined]
 [19] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Tuple{Tracker.TrackedReal{Float64}},1,Nothing}, ::Tuple{CuArray{Tuple{Tracker.TrackedReal{Float64}},1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(gradient),Tuple{Base.RefValue{typeof(f)},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at C:\Users\RTX2080\.julia\packages\CuArrays\4ZX56\src\gpuarray_interface.jl:62
 [20] gpu_call(::Function, ::CuArray{Tuple{Tracker.TrackedReal{Float64}},1,Nothing}, ::Tuple{CuArray{Tuple{Tracker.TrackedReal{Float64}},1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(gradient),Tuple{Base.RefValue{typeof(f)},Base.Broadcast.Extruded{CuArray{Float64,1,Nothing},Tuple{Bool},Tuple{Int64}}}}}, ::Int64) at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\abstract_gpu_interface.jl:151
 [21] gpu_call at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\abstract_gpu_interface.jl:128 [inlined]
 [22] copyto! at C:\Users\RTX2080\.julia\packages\GPUArrays\0lvhc\src\broadcast.jl:48 [inlined]
 [23] copyto! at .\broadcast.jl:863 [inlined]
 [24] copy at .\broadcast.jl:839 [inlined]
 [25] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(gradient),Tuple{Base.RefValue{typeof(f)},CuArray{Float64,1,Nothing}}}) at .\broadcast.jl:819
 [26] top-level scope at REPL[59]:1

This should work using ForwardDiff. It depends on your application if that’s a good idea.

1 Like

See This custom Zygote.jl adjoint is not giving me the speed up I expected and how to migrate to GPU?

Can’t quite get this to work on th GPU unless I differentiate the function myself.