Sampling with replacement on GPU

simsurace · February 26, 2021, 9:51pm

I have a CuArray with weights, i.e. each row is a weight vector (not necessarily normalized but this is easy to accomplish if needed). I want to draw number_of_column numbers in 1:number_of_columns for each of the rows i.i.d. according to the probabilities specified by the weights, or even better, the columns of another CuArray.

Is there an efficient implementation of this in the package ecosystem?

marius311 · February 27, 2021, 12:01am

Here’s a bad solution (basically using inverse transform sampling), but at least short and only uses high-level broadcasts over the CuArrays:

Nweights = 10
Ndraws = 100
w = CUDA.rand(Nweights)

w_norm = w ./ sum(w)
x = CUDA.rand(1, Ndraws)
sum(cumsum(w_norm, dims=1) .< x, dims=1)

The result is 100 random numbers in 1:10 with weights given by w. With a little reasoning about dimensions you should be able to adapt that to mutiple weight vectors too.

Anyway, this is very innefficient since it forms an Nweights-by-Ndraws matrix. I’d be surprised if any of the CPU sampling-with-replacement codes work without triggering scalar indexing for CuArrays, so if you want to improve on this my guess is you’d need to write a custom kernel. This may be relevant too: Categorical Sampling on GPU with Julia · 田俊

pjentsch0 · February 27, 2021, 9:47pm

I would be very interested in this as well.

I spent some time trying to reimplement that blog post in the current version of CUDA.jl.

However it depends on GPUArrays.gpu_rand which is no longer available. I was trying to use the Random123.jl idea suggested by @tkf in his example in the FoldsCUDA docs, but didn’t have much success.

EDIT: this is not true, gpu_rand is still available, it’s just an API that is unstable and so not really documented

In general the ecosystem for sampling from distributions on the GPU in Julia seems really sparse still, which is a shame cause monte carlo simulations are a big application for GPU compute. I would love to help improve it but I just don’t have the expertise to start that kind of project, I would definitely contribute if someone took the lead though.

simsurace · March 2, 2021, 10:19am

Thanks, I also tried to play around with the code from the blog post, but had some issues. Did you manage to get at least an implementation that works (if perhaps inefficient) and would you be happy to share it?

pjentsch0 · March 3, 2021, 4:11am

Yes! I got a working version just now.

There were some changes to gpu_call and global_rng.

function cu_alias_sample!(a::CuArray{Ta}, wv::AbstractVector{Tw}, x::CuArray{Ta}) where {Tw<:Number, Ta}
    length(a) == length(wv) || throw(DimensionMismatch("weight vector must have the same length with label vector"))
    n = length(wv)
    # create alias table
    ap = Vector{Tw}(undef, n)
    alias = Vector{Int64}(undef, n)
    make_alias_table!(wv, sum(wv), ap, alias)
    
    # to device
    alias = CuArray{Int64}(alias)
    ap = CuArray{Tw}(ap)
    
    function kernel(state, _, (alias, ap, x, a, randstate))
        r1, r2 = GPUArrays.gpu_rand(Float32, state, randstate), GPUArrays.gpu_rand(Float32, state, randstate)
        r1 = r1 == 1.0 ? 0.0 : r1
        r2 = r2 == 1.0 ? 0.0 : r2
        i = linear_index(state)
        if i <= length(x)
            j = floor(Int, r1 * n) + 1
            @inbounds x[i] = r2 < ap[j] ? a[j] : a[alias[j]]
        end
        return
    end
    gpu_call(kernel, x, (alias, ap, x, a, GPUArrays.default_rng(typeof(x)).state))
    x
end

If I can find the time I would like to put together a PR that adds this functionality to CUDA.jl, but I am not sure how to contact the author of that blog.

My speedups aren’t quite as significant as the author’s, but I have a very fast CPU and an old gpu.

  26.680 μs (75 allocations: 3.73 KiB)
  1.345 ms (4 allocations: 640 bytes)

(top is CUDA implementation)

simsurace · March 13, 2021, 8:05pm

Thanks for that. I’m currently working on a different problem (sampling binomials), but I’ll get back to this soon.

Topic		Replies	Views
Sampling from a probability distribution on GPU GPU statistics , gpu , distributions	13	740	June 13, 2023
Custom random sampling kernels GPU cuda	16	1625	July 26, 2022
CuArray local scope memory issue GPU	4	308	January 4, 2023
Gpu_rand inside @cuda GPU	2	371	November 18, 2020
Random numbers in CUDA GPU	3	3224	March 23, 2019

Sampling with replacement on GPU

Related topics