Sampling with replacement on GPU

I have a CuArray with weights, i.e. each row is a weight vector (not necessarily normalized but this is easy to accomplish if needed). I want to draw number_of_column numbers in 1:number_of_columns for each of the rows i.i.d. according to the probabilities specified by the weights, or even better, the columns of another CuArray.

Is there an efficient implementation of this in the package ecosystem?


Here’s a bad solution (basically using inverse transform sampling), but at least short and only uses high-level broadcasts over the CuArrays:

Nweights = 10
Ndraws = 100
w = CUDA.rand(Nweights)

w_norm = w ./ sum(w)
x = CUDA.rand(1, Ndraws)
sum(cumsum(w_norm, dims=1) .< x, dims=1)

The result is 100 random numbers in 1:10 with weights given by w. With a little reasoning about dimensions you should be able to adapt that to mutiple weight vectors too.

Anyway, this is very innefficient since it forms an Nweights-by-Ndraws matrix. I’d be surprised if any of the CPU sampling-with-replacement codes work without triggering scalar indexing for CuArrays, so if you want to improve on this my guess is you’d need to write a custom kernel. This may be relevant too: Categorical Sampling on GPU with Julia


I would be very interested in this as well.

I spent some time trying to reimplement that blog post in the current version of CUDA.jl.

However it depends on GPUArrays.gpu_rand which is no longer available. I was trying to use the Random123.jl idea suggested by @tkf in his example in the FoldsCUDA docs, but didn’t have much success.

EDIT: this is not true, gpu_rand is still available, it’s just an API that is unstable and so not really documented

In general the ecosystem for sampling from distributions on the GPU in Julia seems really sparse still, which is a shame cause monte carlo simulations are a big application for GPU compute. I would love to help improve it but I just don’t have the expertise to start that kind of project, I would definitely contribute if someone took the lead though.

1 Like

Thanks, I also tried to play around with the code from the blog post, but had some issues. Did you manage to get at least an implementation that works (if perhaps inefficient) and would you be happy to share it?

Yes! I got a working version just now.

There were some changes to gpu_call and global_rng.

function cu_alias_sample!(a::CuArray{Ta}, wv::AbstractVector{Tw}, x::CuArray{Ta}) where {Tw<:Number, Ta}
    length(a) == length(wv) || throw(DimensionMismatch("weight vector must have the same length with label vector"))
    n = length(wv)
    # create alias table
    ap = Vector{Tw}(undef, n)
    alias = Vector{Int64}(undef, n)
    make_alias_table!(wv, sum(wv), ap, alias)
    # to device
    alias = CuArray{Int64}(alias)
    ap = CuArray{Tw}(ap)
    function kernel(state, _, (alias, ap, x, a, randstate))
        r1, r2 = GPUArrays.gpu_rand(Float32, state, randstate), GPUArrays.gpu_rand(Float32, state, randstate)
        r1 = r1 == 1.0 ? 0.0 : r1
        r2 = r2 == 1.0 ? 0.0 : r2
        i = linear_index(state)
        if i <= length(x)
            j = floor(Int, r1 * n) + 1
            @inbounds x[i] = r2 < ap[j] ? a[j] : a[alias[j]]
    gpu_call(kernel, x, (alias, ap, x, a, GPUArrays.default_rng(typeof(x)).state))

If I can find the time I would like to put together a PR that adds this functionality to CUDA.jl, but I am not sure how to contact the author of that blog.

My speedups aren’t quite as significant as the author’s, but I have a very fast CPU and an old gpu.

  26.680 μs (75 allocations: 3.73 KiB)
  1.345 ms (4 allocations: 640 bytes)

(top is CUDA implementation)

1 Like

Thanks for that. I’m currently working on a different problem (sampling binomials), but I’ll get back to this soon.

1 Like