How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

Now I tried to pass colptr, rowval and nzval of SparseMatrixCSC instead of passing sparse vectors.

using CUDA
using LinearAlgebra, SparseArrays

numsample = 10
numfeature = 5
X⊤ = sprandn(numfeature, numsample, 0.5)

X⊤_colptr = CuVector{Int32}(X⊤.colptr)
X⊤_rowval = CuVector{Int32}(X⊤.rowval)
X⊤_nzval = CuVector{Float32}(X⊤.nzval)
α = CuVector{Float32}(randn(numsample))
d = CuVector{Float32}(X⊤ * α)

function update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # calc xᵢ⊤d
    xᵢ⊤d = 0.0
    from = X⊤_colptr[i]
    to = X⊤_colptr[i+1] - 1
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        xᵢ⊤d += Xᵢⱼ * d[j]
    end

    # update αᵢ
    Δαᵢ = 0.1 * xᵢ⊤d    # dummy
    α[i] += Δαᵢ

    # update d
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        @atomic d[j] += Δαᵢ * Xᵢⱼ
    end
end

@cuda threads=numsample update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)

Without @atomic macro, I could run this code.
However with the macro, the following error occurs.

InvalidIRError: compiling kernel update!(CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_gc_pool_alloc)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22
Reason: unsupported dynamic function invocation (call to _to_linear_index)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22
  • How can I use @atomic macro appropriately?
  • Is there any other way of more efficient implementation?