Sampling from a probability distribution on GPU

I’m interested in doing something like

using Distributions: Categorical
using CUDA

p = cu([0.7,0.2,0.1])
idx = rand(Categorical(p))

Is there a way to sample from a vector of probability masses on the GPU? This kind of stuff comes up a fair amount in machine learning.

You mean, easy or convenient way? In native CUDA? If not, I guess that your job is not much difficult to implement.

I guess something that’s convenient? I don’t have much experience with sampling algorithms. But if something needs to be implemented, I could do it with the right guidance.

Is your intention to be able to call rand(Categorical(p)) from within a GPU kernel? Or does the code you posted just not work, and you want to know how to make it work?

This is not a particularly helpful answer; if it’s not difficult to implement, then maybe you could instead point to some resources that would help the OP?

This type of question comes up often.
Just a few days ago I saw and answered a very similar question on Slack.

I happen to have some CUDA.jl kernel code in a package of mine that I haven’t touched in a couple of years that could serve as a starting point. It is using a naive algorithm though.

using CUDA
using BinomialSynapses: indices!

function rand_categorical(p, n)
     v = repeat(p', n ÷ length(p) + 1, 1)
     idx = last(indices!(v))
     return idx[1:n]
end

p = cu([0.7,0.2,0.1]) # does not need to be normalized
samples = rand_categorical(p, 1000)

should give you a CuVector of length 1000 of categorical samples.

But since that function was written for a specific application (resampling particles) where the number of samples needed was equal to the length of p, and where there were a lot of different ps, this is not going to be competitive in performance if you have short p and need lots of samples. You are (much) better off just copying your p to the CPU and calling rand there, and then copying the samples back to the GPU if needed.

An algorithm that is efficient for lots of repeated samples is probably going to use alias tables. See e.g. GitHub - ByteHamster/alias-table-gpu: Efficient construction of and sampling from alias tables on the GPU and the associated paper.

It would be nice to have a library for efficient/state-of-the-art sampling algorithms on GPUs using some portable approach like KernelAbstractions.jl?

1 Like

The latter. I don’t think Categorical works with CuArray.

Thanks that’s helpful.

My application is in Reinforcement Learning. I only need one sample from my distribution which is the action I will take in the next time step. In this situation, perhaps I’m better off just moving the array to CPU and calling rand there?

I looked around online and I guess it’s possible to implement sampling using the inverse transform method for reasonable distributions? I think CUDA already provides a cumsum for the CDF. So one might only need to implement a binary search on CuArray to use this method?

Yeah, unless you are running many agents in parallel you don’t need to sample on the GPU.

I was also wondering what’s the best solution to this.

Another possibility is the Gumbel trick, in which case one only needs a CUDA function that returns the index of the maximum entry in an array.

I wrote a blog on it several years ago, not sure if it still works. But should be a good starting point :wink:

https://tianjun.me/essays/Categorical_Sampling_on_GPU_with_Julia/

1 Like

Thanks. I see you use an alias table approach?

Unfortunately I am using the categorical sampling during training so parameters are changing quickly and maintaining an alias table seems like not the right approach because I won’t generate that many samples at fixed parameter values. But I could be wrong.

In that case, I’d prefer the Gumbel trick. The extra allocation is trivial.