CUDA atomic add for ForwardDiff duals?

Hi everyone,

I’ve been trying to auto differentiate some code that runs on the GPU using ForwardDiff. Some of the kernels perform an atomic addition operation via CUDA.@atomic, but that doesn’t seem to be supported for ForwardDiff.Dual types. Is there a way to define that behavior myself? Would it even be possible? The functionality I’m looking for is similar to what was offered by this old CUDAatomics code found here: alandion/CUDAatomics.jl (github.com), except the partial field contains a ForwardDiff.Partial and not just a Float32. If it’s possible but requires some work, I am willing to look into how to do it and contribute if others are also interested, although it may take a while for me to learn the ropes as I am still quite new to CUDA.jl.

Here’s a sample toy problem (which I am aware can be done without the atomic operation):

using ForwardDiff
using CUDA

function test_kernel!(A, x, pows, nmax)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= nmax
        #CUDA.@atomic A[1] += x[i]^pows[i]
        CUDA.atomic_add!(pointer(A, 1), x[i]^pows[i])
    end
    return
end


function objective_func(x::Vector{T}) where {T<:Real}
    nmax = 5
    x_gpu = cu(x)
    res = CUDA.zeros(T, 1)
    pows = CuArray{T}([1.0;2.0;3.0;4.0;5.0])
    nthreads = 5
    nbx = ceil(Int, nmax/nthreads)
    CUDA.@sync begin
        @cuda threads=nthreads blocks=nbx test_kernel!(res, x_gpu, pows, nmax)
    end

    return Array(res)[1]
end

x_test = ones(Float32, 5)
println("Function Eval with Floats:")
display(objective_func(x_test))
println("Gradient Eval with Duals:")
display(ForwardDiff.gradient(objective_func, x_test))

Executing the code produces the following error:

ERROR: InvalidIRError: compiling kernel #test_kernel!(CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)

Thanks!

CUDA.jl only supports atomic operations on primitive types, because we don’t really have support for interpreting aggregate types (like Dual) as a single bitstype, and it isn’t legal to split the atomic operation to act separately on the fields of the Dual. You can add the necessary definitions, of course:

using CUDA

# minimal Dual type
struct Dual{T}
    real::T
    dual::T
end
Base.one(::Type{Dual{T}}) where {T} = Dual{T}(1, 0)
Base.:(+)(x::Dual{T}, y::Dual{T}) where T = Dual{T}(x.real + y.real, x.dual + y.dual)

# approach 1: convince CUDA you can do an atomic addition of duals separately.
#             this is not technically correct as the addition isn't truely atomic,
#             but might suffice depending on the use case (e.g. global accumulation).
#             this will only work for element types that are supported by atomic_add!
@inline function CUDA.atomic_arrayset(A::AbstractArray{Dual{T}}, I::Integer, op::typeof(+),
                                      val::Dual{T}) where {T}
    real_ptr = pointer(reinterpret(T, A), (I-1)*2+1)
    CUDA.atomic_add!(real_ptr, val.real)
    dual_ptr = pointer(reinterpret(T, A), (I-1)*2+2)
    CUDA.atomic_add!(dual_ptr, val.dual)
end

# approach 2: convince CUDA you can do an atomic exchange of duals by casting to a bitstype.
#             this is better, because it will ensure the atomic operation happens atomically,
#             but relies on the atomic fallback mechanism (i.e. not an atomic addition, but
#             a compare-and-swap loop) and is more limited because it requires the widened
#             type being supported by the hardware for CAS (64-bits only is on sm_60+,
#             128-bits CAS isn't supported)
using Core: LLVMPtr
widen(::Type{Float32}) = Float64
@inline function CUDA.atomic_cas!(ptr::LLVMPtr{Dual{T},A}, cmp::Dual{T}, val::Dual{T}) where {T,A}
    U = widen(T)
    wide_ptr = reinterpret(LLVMPtr{U,A}, ptr)

    # XXX: this is JuliaLang/julia#43065 (type punning of aggregates)
    cmp_ref = Ref(cmp)
    wide_cmp = GC.@preserve cmp_ref begin
        cmp_ptr = Base.unsafe_convert(Ptr{Dual{T}}, cmp_ref)
        unsafe_load(reinterpret(Ptr{U}, cmp_ptr))
    end

    val_ref = Ref(cmp)
    wide_val = GC.@preserve val_ref begin
        val_ptr = Base.unsafe_convert(Ptr{Dual{T}}, val_ref)
        unsafe_load(reinterpret(Ptr{U}, val_ptr))
    end

    wide_ret = CUDA.atomic_cas!(wide_ptr, wide_cmp, wide_val)

    wide_ret_ref = Ref(wide_ret)
    GC.@preserve wide_ret_ref begin
        wide_ret_ptr = Base.unsafe_convert(Ptr{U}, wide_ret_ref)
        unsafe_load(reinterpret(Ptr{Dual{T}}, wide_ret_ptr))
    end
end

function kernel(A)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(A)
        CUDA.@atomic A[i] += one(eltype(A))
    end
    return
end

function main()
    @show A = cu([1,2])
    @cuda threads=length(A) kernel(A)
    @show A

    @show B = cu([Dual{Float32}(1,2),Dual{Float32}(3,4)])
    @cuda threads=length(B) kernel(B)
    @show B
end

Great! It makes sense that in general adding a Dual cannot be actually done atomically unless the second approach is used, and that the second approach is limited to a small number of partials if I understand correctly. That works and was super helpful. Thanks!