How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

msekino · May 28, 2020, 3:15pm

Now I want to parallerize something like the following dual coordinate descent method.
I’m trying to use CUDA.jl, however I’m stuck for days…

# X = (x[1], ..., x[numvec])
numvec = 500000
lenvec = 50000
xs = [sprandn(lenvec, 0.01) for i in 1:numvec]

# w is the parameter which I want to optimize.
w = randn(numvec)

# d = X * w = ∑ᵢ w[i] xs[i] is the intermediate variable for reducing computational cost.
d = Vector{Float64}(sum(w .* xs))

# Optimization loop (What I want to parallelize on GPU)
for iter = 1:numiter
    for i in 1:numvec
        update!(i, w, d, xs[i])
    end
end

"""
Update w and d
"""
function update!(i::Integer, w::Vector, d::Vector, xᵢ::SparseVector)
    xᵢ⊤d = dot(xᵢ, d)
    Δwᵢ = grad(i, w[i], xᵢ⊤d)
    w[i] += Δwᵢ
    @. d += Δwᵢ * xᵢ
end

Please note that reading and updating ‘d’ should be atomic.
Is it possible to parallerize on GPU and realize speedup?
If so, an example code would be so helpful, thank you.

Best regards.

msekino · May 29, 2020, 1:00am

At first, I run the following on CPU

using CUDA, CUDA.CUSPARSE
using LinearAlgebra, SparseArrays

numvec = 10
lenvec = 5
xs = [sprandn(lenvec, 0.5) for i in 1:numvec]
w = randn(numvec)
d = Vector{Float64}(sum(w .* xs))

obtains

5-element Array{Float64,1}:
 -0.12287585509797144
  0.33370779975590414
  2.9808236024752204
 -1.055451750249576
 -2.7127608286874803

Next I tried

x̃s = [CuSparseVector(x) for x in xs]
w̃ = CuVector(w)
d̃ = CuVector(zeros(lenvec))
for i in 1:numvec
    axpyi!(w̃[i], x̃s[i], d̃, 'O')
end
d̃

obtains the same result with warning for scalar operations…
I tried

function addto!(d, w, xs::Vector{CuSparseVector{Float64}})
    i = threadIdx().x
    axpyi!(w[i], xs[i], d, 'O')
    nothing
end

d̃ = CuVector(zeros(lenvec))
@cuda threads=10 addto!(d̃, w̃, x̃s)

results in KernelError: kernel returns a value of type Union{}.

maleadt · May 29, 2020, 9:03am

You’re indexing a CuArray yourself, w̃[i], not sure what you expect.

The actual error tells you what the problem is and how to debug: If the returned value is of type Union{}, your Julia code probably throws an exception. You are also passing a CPU vector to a GPU kernel, that will never work. Neither will passing sparse vectors – CuSparseArray is an abstraction to work with the CUSPARSE library, and cannot be passed to kernels.

msekino · May 29, 2020, 10:19am

Ah! Warning should occur, of course because I was indexing a CuArray.
I wanted to confirm the usage of the axpyi! using that code.

Oh, I see… I have to reconsider the algorithm for parallelizing on the GPU.

Thank you!

msekino · May 29, 2020, 4:17pm

Now I tried to pass colptr, rowval and nzval of SparseMatrixCSC instead of passing sparse vectors.

using CUDA
using LinearAlgebra, SparseArrays

numsample = 10
numfeature = 5
X⊤ = sprandn(numfeature, numsample, 0.5)

X⊤_colptr = CuVector{Int32}(X⊤.colptr)
X⊤_rowval = CuVector{Int32}(X⊤.rowval)
X⊤_nzval = CuVector{Float32}(X⊤.nzval)
α = CuVector{Float32}(randn(numsample))
d = CuVector{Float32}(X⊤ * α)

function update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    # calc xᵢ⊤d
    xᵢ⊤d = 0.0
    from = X⊤_colptr[i]
    to = X⊤_colptr[i+1] - 1
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        xᵢ⊤d += Xᵢⱼ * d[j]
    end

    # update αᵢ
    Δαᵢ = 0.1 * xᵢ⊤d    # dummy
    α[i] += Δαᵢ

    # update d
    for elindex in from:to
        j = X⊤_rowval[elindex]
        Xᵢⱼ = X⊤_nzval[elindex]
        @atomic d[j] += Δαᵢ * Xᵢⱼ
    end
end

@cuda threads=numsample update!(α, d, X⊤_colptr, X⊤_rowval, X⊤_nzval)

Without @atomic macro, I could run this code.
However with the macro, the following error occurs.

InvalidIRError: compiling kernel update!(CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Int32,1,CUDA.AS.Global}, CuDeviceArray{Float32,1,CUDA.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to gpu_gc_pool_alloc)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22
Reason: unsupported dynamic function invocation (call to _to_linear_index)
Stacktrace:
 [1] atomic_arrayset at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:464
 [2] macro expansion at C:\Users\msekino\.julia\packages\CUDA\5t6R9\src\device\cuda\atomics.jl:459
 [3] update! at In[34]:22

How can I use @atomic macro appropriately?
Is there any other way of more efficient implementation?

maleadt · May 29, 2020, 6:42pm

The @atomic macro bypasses some of the getindex functionality, and is a little fragile as a result. In this case, it fails because you’re indexing with an In32. Then it fails because you’re adding a Float32 value to a Float64 array. Using @atomic d[Int(j)] += Float32(Δαᵢ * Xᵢⱼ) works around both.

msekino · May 29, 2020, 10:56pm

Thank you so much, it works!
I will try it on the real data X and compare the running time with CPU!

msekino · May 31, 2020, 3:53pm

@maleadt I’m sorry, I’ve too early to select the answer as the solution, but I’m stuck again…

I implemented the code for my own model (i.e. Δαᵢ = 0.1 * xᵢ⊤d # dummy is replaced with an update equation). When the code executed on CPU, the result was fine. On the other hand, when the code executed on GPU with blocks=Int(floor(numsample/32)) threads=32 instead of threads=numsample, the numerical result went to extremely large values…

Now I want to ask the followings.

Can d::CuVector be shared by all blocks and threads?
- I expected that the reading d (at xᵢ⊤d += Xᵢⱼ * d[j]) is an atomic read. Therefore while one thread reading d[j], all other blocks and threads cannot alter or corrupt the value d[j] which the other threads are reading.
- I expected that the atomic updating d is reflected on all blocks and threads immediately after the update.
Or, d::CuVector is just a host representation, therefore d is no longer shared after copied to GPU memory. If so, are there any way to realize my expectations?

maleadt · June 1, 2020, 6:44am

Your assignment to d might happen after some threads have read their value. You need calls to synchronize_thread to prevent that. Do note that’s only possible because you’re launching a single block – when using multiple blocks it is impossible to synchronize threads – which isn’t ideal for performance.

msekino · June 1, 2020, 12:57pm

Thank you so much again!

The algorithm works, and the objective function monotonically descreases. However, the code ended up being slower than the CPU version… I’ve come to think that this algorithm is not suitable for GPUs.

Even so, because this was my first experience with CUDA, it was a great learning experience for me. I will try again with other algorithms.

Thanks, again!

maleadt · June 1, 2020, 1:34pm

You will need to optimize a little given the GPUs architecture, e.g., here your kernel seems to be loading a lot of identical values in each thread, which you better load only from in a single thread and cache in shared memory, Also be sure to use @inbounds where possible, as bounds checking branches are much more expensive on the GPU.
Generally, only few embarrassingly parallel algorithms get easy speed-ups when parallelizing them like that, and even then it often depends on the memory pressure and arithmetic intensity. Beyond that, you will need to optimize for the architecture.

msekino · June 1, 2020, 2:11pm

Thanks! Now I’ve tried @inbounds according to your advice., next I will try shared memory. Now I understand that GPU specific optimization is necessary to get the true performance out of GPUs…

Topic		Replies	Views
Problem with GPU programming GPU cudanative , cuda	4	1068	September 13, 2019
Performance of kernel function GPU	3	463	November 28, 2019
CUDAnative use multiple GPUs GPU gpu , cudanative , parallel	5	1787	March 24, 2018
Accessing array elements too slow? GPU	10	615	April 23, 2021
CUDA.jl - Atomic Addition Error To A Sparse Array Inside CUDA Kernel GPU cuda , error , memory-allocation , sparse , atomic	4	721	March 31, 2022

How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

Related topics