How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

Now I want to parallerize something like the following dual coordinate descent method.
I’m trying to use CUDA.jl, however I’m stuck for days…

# X = (x[1], ..., x[numvec])
numvec = 500000
lenvec = 50000
xs = [sprandn(lenvec, 0.01) for i in 1:numvec]

# w is the parameter which I want to optimize.
w = randn(numvec)

# d = X * w = ∑ᵢ w[i] xs[i] is the intermediate variable for reducing computational cost.
d = Vector{Float64}(sum(w .* xs))

# Optimization loop (What I want to parallelize on GPU)
for iter = 1:numiter
    for i in 1:numvec
        update!(i, w, d, xs[i])
    end
end

"""
Update w and d
"""
function update!(i::Integer, w::Vector, d::Vector, xᵢ::SparseVector)
    xᵢ⊤d = dot(xᵢ, d)
    Δwᵢ = grad(i, w[i], xᵢ⊤d)
    w[i] += Δwᵢ
    @. d += Δwᵢ * xᵢ
end

Please note that reading and updating ‘d’ should be atomic.
Is it possible to parallerize on GPU and realize speedup?
If so, an example code would be so helpful, thank you.

Best regards.

1 Like

At first, I run the following on CPU

using CUDA, CUDA.CUSPARSE
using LinearAlgebra, SparseArrays

numvec = 10
lenvec = 5
xs = [sprandn(lenvec, 0.5) for i in 1:numvec]
w = randn(numvec)
d = Vector{Float64}(sum(w .* xs))

obtains

5-element Array{Float64,1}:
 -0.12287585509797144
  0.33370779975590414
  2.9808236024752204
 -1.055451750249576
 -2.7127608286874803

Next I tried

x̃s = [CuSparseVector(x) for x in xs]
w̃ = CuVector(w)
d̃ = CuVector(zeros(lenvec))
for i in 1:numvec
    axpyi!(w̃[i], x̃s[i], d̃, 'O')
end
d̃

obtains the same result with warning for scalar operations…
I tried

function addto!(d, w, xs::Vector{CuSparseVector{Float64}})
    i = threadIdx().x
    axpyi!(w[i], xs[i], d, 'O')
    nothing
end

d̃ = CuVector(zeros(lenvec))
@cuda threads=10 addto!(d̃, w̃, x̃s)

results in KernelError: kernel returns a value of type Union{}.

You’re indexing a CuArray yourself, w̃[i], not sure what you expect.

The actual error tells you what the problem is and how to debug: If the returned value is of type Union{}, your Julia code probably throws an exception. You are also passing a CPU vector to a GPU kernel, that will never work. Neither will passing sparse vectors – CuSparseArray is an abstraction to work with the CUSPARSE library, and cannot be passed to kernels.

1 Like

Ah! Warning should occur, of course because I was indexing a CuArray.
I wanted to confirm the usage of the axpyi! using that code.

Oh, I see… I have to reconsider the algorithm for parallelizing on the GPU.

Thank you!

Now I tried to pass colptr, rowval and nzval of SparseMatrixCSC instead of passing sparse vectors.

using CUDA
using LinearAlgebra, SparseArrays

numsample = 10
numfeature = 5
X⊤ = sprandn(numfeature, numsample, 0.5)

X⊤_colptr = CuVector{Int32}(X⊤.colptr)
X⊤_rowval = CuVector{Int32}(X⊤.rowval)
X⊤_nzval = CuVector{Float32}(X⊤.nzval)
α = CuVector{Float32}(randn(numsample))
d = CuVector{Float32}(X⊤ * α)

function update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # calc xᵢ⊤d
    xᵢ⊤d = 0.0
    from = X⊤_colptr[i]
    to = X⊤_colptr[i+1] - 1
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        xᵢ⊤d += Xᵢⱼ * d[j]
    end

    # update αᵢ
    Δαᵢ = 0.1 * xᵢ⊤d    # dummy
    α[i] += Δαᵢ

    # update d
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        @atomic d[j] += Δαᵢ * Xᵢⱼ
    end
end

@cuda threads=numsample update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)

Without @atomic macro, I could run this code.
However with the macro, the following error occurs.

InvalidIRError: compiling kernel update!(CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_gc_pool_alloc)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22
Reason: unsupported dynamic function invocation (call to _to_linear_index)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22
  • How can I use @atomic macro appropriately?
  • Is there any other way of more efficient implementation?

The @atomic macro bypasses some of the getindex functionality, and is a little fragile as a result. In this case, it fails because you’re indexing with an In32. Then it fails because you’re adding a Float32 value to a Float64 array. Using @atomic d[Int(j)] += Float32(Δαᵢ * Xᵢⱼ) works around both.

1 Like

Thank you so much, it works!
I will try it on the real data X and compare the running time with CPU!

@maleadt I’m sorry, I’ve too early to select the answer as the solution, but I’m stuck again…

I implemented the code for my own model (i.e. Δαᵢ = 0.1 * xᵢ⊤d # dummy is replaced with an update equation). When the code executed on CPU, the result was fine. On the other hand, when the code executed on GPU with blocks=Int(floor(numsample/32)) threads=32 instead of threads=numsample, the numerical result went to extremely large values…

Now I want to ask the followings.

  • Can d::CuVector be shared by all blocks and threads?
    • I expected that the reading d (at xᵢ⊤d += Xᵢⱼ * d[j]) is an atomic read. Therefore while one thread reading d[j], all other blocks and threads cannot alter or corrupt the value d[j] which the other threads are reading.
    • I expected that the atomic updating d is reflected on all blocks and threads immediately after the update.
  • Or, d::CuVector is just a host representation, therefore d is no longer shared after copied to GPU memory. If so, are there any way to realize my expectations?

Your assignment to d might happen after some threads have read their value. You need calls to synchronize_thread to prevent that. Do note that’s only possible because you’re launching a single block – when using multiple blocks it is impossible to synchronize threads – which isn’t ideal for performance.

1 Like

Thank you so much again!

The algorithm works, and the objective function monotonically descreases. However, the code ended up being slower than the CPU version… I’ve come to think that this algorithm is not suitable for GPUs.

Even so, because this was my first experience with CUDA, it was a great learning experience for me. I will try again with other algorithms.

Thanks, again!

You will need to optimize a little given the GPUs architecture, e.g., here your kernel seems to be loading a lot of identical values in each thread, which you better load only from in a single thread and cache in shared memory, Also be sure to use @inbounds where possible, as bounds checking branches are much more expensive on the GPU.
Generally, only few embarrassingly parallel algorithms get easy speed-ups when parallelizing them like that, and even then it often depends on the memory pressure and arithmetic intensity. Beyond that, you will need to optimize for the architecture.

Thanks! Now I’ve tried @inbounds according to your advice., next I will try shared memory. Now I understand that GPU specific optimization is necessary to get the true performance out of GPUs…