How to parallerize dual coordinet descent mehods on GPU using CUDA.jl?

@maleadt I’m sorry, I’ve too early to select the answer as the solution, but I’m stuck again…

I implemented the code for my own model (i.e. Δαᵢ = 0.1 * xᵢ⊤d # dummy is replaced with an update equation). When the code executed on CPU, the result was fine. On the other hand, when the code executed on GPU with blocks=Int(floor(numsample/32)) threads=32 instead of threads=numsample, the numerical result went to extremely large values…

Now I want to ask the followings.

  • Can d::CuVector be shared by all blocks and threads?
    • I expected that the reading d (at xᵢ⊤d += Xᵢⱼ * d[j]) is an atomic read. Therefore while one thread reading d[j], all other blocks and threads cannot alter or corrupt the value d[j] which the other threads are reading.
    • I expected that the atomic updating d is reflected on all blocks and threads immediately after the update.
  • Or, d::CuVector is just a host representation, therefore d is no longer shared after copied to GPU memory. If so, are there any way to realize my expectations?