Sampling from a probability distribution on GPU

ArjunNarayanan · February 10, 2023, 2:01am

I’m interested in doing something like

using Distributions: Categorical
using CUDA

p = cu([0.7,0.2,0.1])
idx = rand(Categorical(p))

Is there a way to sample from a vector of probability masses on the GPU? This kind of stuff comes up a fair amount in machine learning.

rmsmsgood · February 10, 2023, 2:09am

You mean, easy or convenient way? In native CUDA? If not, I guess that your job is not much difficult to implement.

ArjunNarayanan · February 10, 2023, 2:18am

I guess something that’s convenient? I don’t have much experience with sampling algorithms. But if something needs to be implemented, I could do it with the right guidance.

jpsamaroo · February 10, 2023, 11:30pm

Is your intention to be able to call rand(Categorical(p)) from within a GPU kernel? Or does the code you posted just not work, and you want to know how to make it work?

jpsamaroo · February 10, 2023, 11:40pm

This is not a particularly helpful answer; if it’s not difficult to implement, then maybe you could instead point to some resources that would help the OP?

simsurace · February 11, 2023, 1:13am

This type of question comes up often.
Just a few days ago I saw and answered a very similar question on Slack.

I happen to have some CUDA.jl kernel code in a package of mine that I haven’t touched in a couple of years that could serve as a starting point. It is using a naive algorithm though.

using CUDA
using BinomialSynapses: indices!

function rand_categorical(p, n)
     v = repeat(p', n ÷ length(p) + 1, 1)
     idx = last(indices!(v))
     return idx[1:n]
end

p = cu([0.7,0.2,0.1]) # does not need to be normalized
samples = rand_categorical(p, 1000)

should give you a CuVector of length 1000 of categorical samples.

But since that function was written for a specific application (resampling particles) where the number of samples needed was equal to the length of p, and where there were a lot of different ps, this is not going to be competitive in performance if you have short p and need lots of samples. You are (much) better off just copying your p to the CPU and calling rand there, and then copying the samples back to the GPU if needed.

An algorithm that is efficient for lots of repeated samples is probably going to use alias tables. See e.g. GitHub - ByteHamster/alias-table-gpu: Efficient construction of and sampling from alias tables on the GPU and the associated paper.

It would be nice to have a library for efficient/state-of-the-art sampling algorithms on GPUs using some portable approach like KernelAbstractions.jl?

ArjunNarayanan · February 13, 2023, 6:01pm

The latter. I don’t think Categorical works with CuArray.

ArjunNarayanan · February 13, 2023, 6:03pm

Thanks that’s helpful.

My application is in Reinforcement Learning. I only need one sample from my distribution which is the action I will take in the next time step. In this situation, perhaps I’m better off just moving the array to CPU and calling rand there?

ArjunNarayanan · February 13, 2023, 6:07pm

I looked around online and I guess it’s possible to implement sampling using the inverse transform method for reasonable distributions? I think CUDA already provides a cumsum for the CDF. So one might only need to implement a binary search on CuArray to use this method?

simsurace · February 13, 2023, 9:36pm

Yeah, unless you are running many agents in parallel you don’t need to sample on the GPU.

e3c6 · June 13, 2023, 12:24pm

I was also wondering what’s the best solution to this.

Another possibility is the Gumbel trick, in which case one only needs a CUDA function that returns the index of the maximum entry in an array.

findmyway · June 13, 2023, 12:47pm

I wrote a blog on it several years ago, not sure if it still works. But should be a good starting point

https://tianjun.me/essays/Categorical_Sampling_on_GPU_with_Julia/

e3c6 · June 13, 2023, 12:49pm

Thanks. I see you use an alias table approach?

Unfortunately I am using the categorical sampling during training so parameters are changing quickly and maintaining an alias table seems like not the right approach because I won’t generate that many samples at fixed parameter values. But I could be wrong.

findmyway · June 13, 2023, 12:52pm

In that case, I’d prefer the Gumbel trick. The extra allocation is trivial.

Topic		Replies	Views
Custom random sampling kernels GPU cuda	16	1625	July 26, 2022
Sampling with replacement on GPU GPU cuda	5	767	March 13, 2021
Gpu_rand inside @cuda GPU	2	371	November 18, 2020
[Question] Distributions.jl with CUDA Statistics question , cuda , distributions	0	667	October 3, 2020
Why is GPU kernel rand() not as "random" as CPU rand()? GPU question , cuda , kernel	10	497	May 17, 2023

Sampling from a probability distribution on GPU

Related topics