CUDA atomic add for ForwardDiff duals?

Mark_Lau · August 19, 2022, 4:29am

Hi everyone,

I’ve been trying to auto differentiate some code that runs on the GPU using ForwardDiff. Some of the kernels perform an atomic addition operation via CUDA.@atomic, but that doesn’t seem to be supported for ForwardDiff.Dual types. Is there a way to define that behavior myself? Would it even be possible? The functionality I’m looking for is similar to what was offered by this old CUDAatomics code found here: alandion/CUDAatomics.jl (github.com), except the partial field contains a ForwardDiff.Partial and not just a Float32. If it’s possible but requires some work, I am willing to look into how to do it and contribute if others are also interested, although it may take a while for me to learn the ropes as I am still quite new to CUDA.jl.

Here’s a sample toy problem (which I am aware can be done without the atomic operation):

using ForwardDiff
using CUDA

function test_kernel!(A, x, pows, nmax)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= nmax
        #CUDA.@atomic A[1] += x[i]^pows[i]
        CUDA.atomic_add!(pointer(A, 1), x[i]^pows[i])
    end
    return
end


function objective_func(x::Vector{T}) where {T<:Real}
    nmax = 5
    x_gpu = cu(x)
    res = CUDA.zeros(T, 1)
    pows = CuArray{T}([1.0;2.0;3.0;4.0;5.0])
    nthreads = 5
    nbx = ceil(Int, nmax/nthreads)
    CUDA.@sync begin
        @cuda threads=nthreads blocks=nbx test_kernel!(res, x_gpu, pows, nmax)
    end

    return Array(res)[1]
end

x_test = ones(Float32, 5)
println("Function Eval with Floats:")
display(objective_func(x_test))
println("Gradient Eval with Duals:")
display(ForwardDiff.gradient(objective_func, x_test))

Executing the code produces the following error:

ERROR: InvalidIRError: compiling kernel #test_kernel!(CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, CuDeviceVector{ForwardDiff.Dual{ForwardDiff.Tag{typeof(objective_func), Float32}, Float32, 5}, 1}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_add!)

Thanks!

maleadt · August 19, 2022, 7:35am

CUDA.jl only supports atomic operations on primitive types, because we don’t really have support for interpreting aggregate types (like Dual) as a single bitstype, and it isn’t legal to split the atomic operation to act separately on the fields of the Dual. You can add the necessary definitions, of course:

using CUDA

# minimal Dual type
struct Dual{T}
    real::T
    dual::T
end
Base.one(::Type{Dual{T}}) where {T} = Dual{T}(1, 0)
Base.:(+)(x::Dual{T}, y::Dual{T}) where T = Dual{T}(x.real + y.real, x.dual + y.dual)

# approach 1: convince CUDA you can do an atomic addition of duals separately.
#             this is not technically correct as the addition isn't truely atomic,
#             but might suffice depending on the use case (e.g. global accumulation).
#             this will only work for element types that are supported by atomic_add!
@inline function CUDA.atomic_arrayset(A::AbstractArray{Dual{T}}, I::Integer, op::typeof(+),
                                      val::Dual{T}) where {T}
    real_ptr = pointer(reinterpret(T, A), (I-1)*2+1)
    CUDA.atomic_add!(real_ptr, val.real)
    dual_ptr = pointer(reinterpret(T, A), (I-1)*2+2)
    CUDA.atomic_add!(dual_ptr, val.dual)
end

# approach 2: convince CUDA you can do an atomic exchange of duals by casting to a bitstype.
#             this is better, because it will ensure the atomic operation happens atomically,
#             but relies on the atomic fallback mechanism (i.e. not an atomic addition, but
#             a compare-and-swap loop) and is more limited because it requires the widened
#             type being supported by the hardware for CAS (64-bits only is on sm_60+,
#             128-bits CAS isn't supported)
using Core: LLVMPtr
widen(::Type{Float32}) = Float64
@inline function CUDA.atomic_cas!(ptr::LLVMPtr{Dual{T},A}, cmp::Dual{T}, val::Dual{T}) where {T,A}
    U = widen(T)
    wide_ptr = reinterpret(LLVMPtr{U,A}, ptr)

    # XXX: this is JuliaLang/julia#43065 (type punning of aggregates)
    cmp_ref = Ref(cmp)
    wide_cmp = GC.@preserve cmp_ref begin
        cmp_ptr = Base.unsafe_convert(Ptr{Dual{T}}, cmp_ref)
        unsafe_load(reinterpret(Ptr{U}, cmp_ptr))
    end

    val_ref = Ref(cmp)
    wide_val = GC.@preserve val_ref begin
        val_ptr = Base.unsafe_convert(Ptr{Dual{T}}, val_ref)
        unsafe_load(reinterpret(Ptr{U}, val_ptr))
    end

    wide_ret = CUDA.atomic_cas!(wide_ptr, wide_cmp, wide_val)

    wide_ret_ref = Ref(wide_ret)
    GC.@preserve wide_ret_ref begin
        wide_ret_ptr = Base.unsafe_convert(Ptr{U}, wide_ret_ref)
        unsafe_load(reinterpret(Ptr{Dual{T}}, wide_ret_ptr))
    end
end

function kernel(A)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(A)
        CUDA.@atomic A[i] += one(eltype(A))
    end
    return
end

function main()
    @show A = cu([1,2])
    @cuda threads=length(A) kernel(A)
    @show A

    @show B = cu([Dual{Float32}(1,2),Dual{Float32}(3,4)])
    @cuda threads=length(B) kernel(B)
    @show B
end

Mark_Lau · August 19, 2022, 5:59pm

Great! It makes sense that in general adding a Dual cannot be actually done atomically unless the second approach is used, and that the second approach is limited to a small number of partials if I understand correctly. That works and was super helpful. Thanks!