Can anyone point me to examples that include atomic operations using CUDA.jl?
I am having some issues trying to figure out the syntax for pointers, and examples would be helpful.
It’s always useful to look at the source code or tests when there’s no documentation, e.g. https://github.com/JuliaGPU/CUDA.jl/blob/9988e30fee4aab07576e24fe630594d4c30a2f32/src/indexing.jl#L105. The @atomic
macro supports a very limited number of common operations, and you can always use the underlying intrinsics (atomic_add
, etc).
Base Julia is expected to improve its support for (extensible) atomics, so the current CUDA interface is not extensively documented or developed until that happens.
I think I have basic atomic addition working properly now, but my real goal is to implement a sum reduction so I would like to do a shuffle to sum the values across a warp and then do a single atomic add per warp. I’ve been trying to follow the sample code that was linked from the NVIDIA Developer’s Blog post on " High-Performance GPU Computing in the Julia Programming Language" (which looked to be exactly on point), but the details seem to depend on functions defined in CUDAdrv.jl and CUDAnative.jl that are no longer available in CUDA.jl. Any pointers to up-to-date shuffle examples? (And yes an updated version of the reduction code from the blog post would be ideal…)
What about our mapreduce
implementation: https://github.com/JuliaGPU/CUDA.jl/blob/afaec8e0b2a89e09e65f8977c1312b8846c561ed/src/mapreduce.jl
CUDAdrv/CUDAnative/… have been merged in CUDA.jl, and no functionality should have been removed (possibly renamed, e.g. shfl_down
to shfl_down_sync
).
The link to mapreduce
is helpful, but I quickly hit a snag. While I have significant background with CUDA, I am not a computer scientist. To really know how to make use of such codes, I find it very helpful to have access to samples that actually call the relevant functions.
In this case, I can get reduce_warp
to do something useful, but reduce_block
has additional arguments (neutral
and shuffle
) and it would be very helpful to see actual calls to reduce_block
so I can see what appropriate values look like for these arguments.
If reduce_warp
and reduce_block
are not meant to be exposed to the user and I should be using partial_mapreduce_grid
or some other function, I need to find examples that use those functions. I have been having a very hard time trying to find such things and any further pointers would be greatly appreciated.
Those are internal functions used to implement the partial_mapreduce_grid
kernel, which itself isn’t even meant to be used by users. The only interface that’s exposed is mapreduce
and its variants (mapreducedim, reduce, sum, etc) from Base. So the Julia documentation is most relevant here.
If you want to implement your own kernel, you only need to know about the shfl
functions, which are demonstrated in that reduce_block
function (the reason I linked to it). These follow CUDA semantics though, so to some extend you can just use the CUDA documentation and apply that to CUDA.jl. If you run into issues, don’t hesitate to ask here or on the #gpu channel on Slack.
I am implementing a similar problem. I’m trying to find the mean of an expression. I’ve been following the NVIDIA “Optimizing Parallel Reduction in CUDA” example, which has been quite helpful. I have implemented one requirement that simplifies the algorithm. The input arrays must be zero-padded to a multiple of two. This requirement is not explicitly stated in the NVIDIA example, but implied. The other issue is that very large arrays require recursive block reduction for best performance.