@maleadt I’m sorry, I’ve too early to select the answer as the solution, but I’m stuck again…
I implemented the code for my own model (i.e. Δαᵢ = 0.1 * xᵢ⊤d # dummy
is replaced with an update equation). When the code executed on CPU, the result was fine. On the other hand, when the code executed on GPU with blocks=Int(floor(numsample/32)) threads=32
instead of threads=numsample
, the numerical result went to extremely large values…
Now I want to ask the followings.
- Can
d::CuVector
be shared by all blocks and threads?- I expected that the reading
d
(atxᵢ⊤d += Xᵢⱼ * d[j]
) is an atomic read. Therefore while one thread readingd[j]
, all other blocks and threads cannot alter or corrupt the valued[j]
which the other threads are reading. - I expected that the atomic updating
d
is reflected on all blocks and threads immediately after the update.
- I expected that the reading
- Or,
d::CuVector
is just a host representation, therefored
is no longer shared after copied to GPU memory. If so, are there any way to realize my expectations?