Shared memory limitations

My apologies if this has already been asked. But right now, what all operations are supported for shared memory in CUDANative.jl? is it likely that more stuff can/will be added in the future? or a better question - how hard is it to add, like are there major structural changes required or is it relatively straightforward?

Specifically, here is some of the stuff I’m interested in.

  • Subarrays/views - basically stuff beyond just basic point indexing
  • Broadcast notation
  • map/reduce/other array functions
  • Iterators
  • etc.

By the way, I’m not using CUDANative for GPU programming myself but rather I’m trying to build a higher-level GPU framework either in Julia or C++ so the above info would be super useful and much appreciated.

It might make sense to look at how CUDAnative handles shared memory. Shared memory is created and returned to the user as a CuDeviceArray. A CuDeviceArray generally represents addressable memory on the device and is used both for global or shared memory. The implementation is here:

It is worth noting that CuDeviceArray fulfills the AbstractArray interface and as an example some of your questions are already implemented because of that. As an example views, reshaping, iteration, …

Now your question about how hard would it be to implement:

  • Broadcast notation
  • map/reduce/other array functions

I think as long as you only need in-place operations it shouldn’t be too hard. It would probably be a bit finicky, but it should be doable.

The bigger question is: What should they mean? There are two interpretations:

  1. Stick with the SPMD model, each thread is executing it’s own broadcast. No effort is being made from reusing results in a thread block or avoiding bank conflicts.
  2. Make broadcast automatically parallelize across a thread-block for shared memory, or the entire grid for global memory. Much trickier, but more interesting. What to do with divergent broadcasts.

I would say that the reason why we haven’t implemented broadcast or map is because while they are convenient they can also be a shotgun due to the SPMD model used in GPU programming.

As an example:

function kernel()
       shmem = @cuDynamicSharedMem(Int64 blockDim().x)
       shmem .= threadIdx().x
       return nothing
end

Using the consistent semantics from 1. each thread would execute it’s own broadcast, but you what result are you expecting? The user just created a race… Now you could fix that by putting the work on the user by saying: You need to ensure that each memory location is only updated by one thread and so you should have created a view. For my taste it is much easier to just reason locally:

function kernel()
       shmem = @cuDynamicSharedMem(Int64 blockDim().x)
       shmem[threadIdx().x] = threadIdx().x
       return nothing
end

or to use StaticArrays.jl

1 Like

Thank you for your reply! This is very helpful - I didn’t really yet understand CuDeviceArray and that it subtypes AbstractArray. That is good to know.

When I mentioned these, I wasn’t actually thinking that they would do any sort of automatic parallelization. (though I agree they would be interesting, sort of like CUDA’s CUB library) I was asking these questions having in mind - not the entire shared memory of a thread block, but just a per-thread view into the shared memory.

Does that change anything? I’m guessing not, but can you do more stuff with a view into a CuDeviceArray than an owned CuDeviceArray? it seems like you can do more with StaticArrays.jl than with a CuDeviceArray is it possible to simply instantiate an SArray with just a view into a CuDeviceArray or do you need to do a copy from the CuDeviceArray?

Sorry this is a bunch of questions but I basically want to understand - how can I do caching while not restricting the supported functionality for that cached data.

I think it would be interesting to add a StaticView type to StaticArrays, which gives you optimized implementation while imposing the limitiation that you need to know the sizes of the data and therefore have a higher chance of getting the code to run on a GPU.

Do keep in mind that CuDeviceArray is used both for shared and global memory.

But feel free to experiment, the great thing about CUDAnative is that it is all written in Julia so one can extend it and mess with it quite nicely!

Yeah, a StaticView sounds like what I’m looking for. I’ll definitely experiment with trying to make something like that soon - thanks for the help!