How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

At first, I run the following on CPU

using CUDA, CUDA.CUSPARSE
using LinearAlgebra, SparseArrays

numvec = 10
lenvec = 5
xs = [sprandn(lenvec, 0.5) for i in 1:numvec]
w = randn(numvec)
d = Vector{Float64}(sum(w .* xs))

obtains

5-element Array{Float64,1}:
 -0.12287585509797144
  0.33370779975590414
  2.9808236024752204
 -1.055451750249576
 -2.7127608286874803

Next I tried

x̃s = [CuSparseVector(x) for x in xs]
w̃ = CuVector(w)
d̃ = CuVector(zeros(lenvec))
for i in 1:numvec
    axpyi!(w̃[i], x̃s[i], d̃, 'O')
end
d̃

obtains the same result with warning for scalar operations…
I tried

function addto!(d, w, xs::Vector{CuSparseVector{Float64}})
    i = threadIdx().x
    axpyi!(w[i], xs[i], d, 'O')
    nothing
end

d̃ = CuVector(zeros(lenvec))
@cuda threads=10 addto!(d̃, w̃, x̃s)

results in KernelError: kernel returns a value of type Union{}.