I guess something that’s convenient? I don’t have much experience with sampling algorithms. But if something needs to be implemented, I could do it with the right guidance.

Is your intention to be able to call rand(Categorical(p)) from within a GPU kernel? Or does the code you posted just not work, and you want to know how to make it work?

This is not a particularly helpful answer; if it’s not difficult to implement, then maybe you could instead point to some resources that would help the OP?

This type of question comes up often.
Just a few days ago I saw and answered a very similar question on Slack.

I happen to have some CUDA.jl kernel code in a package of mine that I haven’t touched in a couple of years that could serve as a starting point. It is using a naive algorithm though.

using CUDA
using BinomialSynapses: indices!
function rand_categorical(p, n)
v = repeat(p', n ÷ length(p) + 1, 1)
idx = last(indices!(v))
return idx[1:n]
end
p = cu([0.7,0.2,0.1]) # does not need to be normalized
samples = rand_categorical(p, 1000)

should give you a CuVector of length 1000 of categorical samples.

But since that function was written for a specific application (resampling particles) where the number of samples needed was equal to the length of p, and where there were a lot of different ps, this is not going to be competitive in performance if you have short p and need lots of samples. You are (much) better off just copying your p to the CPU and calling rand there, and then copying the samples back to the GPU if needed.

My application is in Reinforcement Learning. I only need one sample from my distribution which is the action I will take in the next time step. In this situation, perhaps I’m better off just moving the array to CPU and calling rand there?

I looked around online and I guess it’s possible to implement sampling using the inverse transform method for reasonable distributions? I think CUDA already provides a cumsum for the CDF. So one might only need to implement a binary search on CuArray to use this method?

Unfortunately I am using the categorical sampling during training so parameters are changing quickly and maintaining an alias table seems like not the right approach because I won’t generate that many samples at fixed parameter values. But I could be wrong.