Custom random sampling kernels

I’m trying to write custom kernels for random sampling from various distributions. Most needed is a Binomial sampler that runs on the GPU. That functionality seems to be under-developed currently.

As a first step I tried to make a kernel which reproduces CUDA.rand(). However, the only way I found to sample a scalar random variable on the GPU was using GPUArrays.gpu_rand:

using CUDA, GPUArrays
import GPUArrays: gpu_rand, default_rng
import CUDA: CuKernelContext

function kernel_rand!(A, randstates)
    index1  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride1 = blockDim().x * gridDim().x
    @inbounds for i in index1:stride1:length(A)
        A[i] = gpu_rand(Float32, CuKernelContext(), randstates)
    end
    return
end

function myrand(m1, m2)
    A          = CUDA.zeros(m1, m2)
    rngstate   = default_rng(typeof(A)).state
    begin
        @cuda threads=256 blocks=ceil(Int, length(A)/256) kernel_rand!(A, rngstate)
    end
    return A
end

and thus I’m probably mixing two packages in ways they are not designed to be used. I’m getting type instabilities, in particular

%5  = Base.getproperty(%4, :state)::AbstractGPUArray{NTuple{4, UInt32}, 1}

The timing shows that my implementation is about three times slower than the one from CUDA.

So the following questions arise:

  1. What is currently the best way to sample a scalar random uniform variable on the GPU?
  2. What is the proper way to call the kernel in order to avoid type instabilities?
  3. Why is my implementation slow?
1 Like

It seems like you’re sending in a 2D array with CUDA.zeros(m1, m2), but treating it like a 1-D array in your kernel. Am I missing something there?

I have been working on this also. I don’t know enough to answer your questions, but here is an implementation with gpu_call that is almost as fast as the one in CUDA.jl. I don’t fully understand what is going on though.

function rand_2!(A)
    function kernel(state, _, (A,randstate))
        rand_sample = GPUArrays.gpu_rand(Float64, state, randstate)
        i = linear_index(state)
        if i <= length(A)
            A[i] = rand_sample
        end
        return nothing
    end
    gpu_call(kernel, A, (A, GPUArrays.default_rng(typeof(A)).state))
end
function test_rand()
    A = zeros(10_000)
    B = cu(A)
    C = cu(A)
    @btime rand!($A)
    @btime CUDA.rand!($B)
    @btime rand_2!($C)
    display(C)
end

I have tried modifying your version, and no matter what I do (I modified it to be mutating of course), it’s still slower than using gpu_call.

Here is a kernel that does geometric sampling. It is about an order of magnitude faster than the same process on a CPU. Uniform sampling doesn’t get the same speedup it seems.

function geometric_sample!(A,p)
    function kernel(A,p,randstate)
        index = threadIdx().x
        stride = blockDim().x
        @inbounds for i in index:stride:length(A)
            rand_sample = GPUArrays.gpu_rand(Float64,  CUDA.CuKernelContext(), randstate)
            X = 0
            sum = prod = p
            q = 1.0 - p
            while (rand_sample > sum) 
                prod *= q
                sum += prod
                X += 1
            end
            A[i] = X
        end
        return nothing
    end
    @cuda threads=256 blocks = ceil(Int, length(A)/256) kernel(A, p, GPUArrays.default_rng(typeof(A)).state)
end

Still working on a poisson kernel!

EDIT: I should add that the algorithm for the geometric sampler came from cupy’s source, although their binomial implementation is… intimidating.

1 Like

Yes, but here is a version with 2D indexing. It is slower on my machine!

function kernel_rand!(A, randstates)
    index1  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride1 = blockDim().x * gridDim().x
    index2  = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    stride2 = blockDim().y * gridDim().y
    for i in index1:stride1:size(A, 1)
        for j in index2:stride2:size(A, 2)
            A[i, j] += gpu_rand(Float32, CuKernelContext(), randstates)
        end
    end
    return
end

function myrand(m1, m2)
    A = CUDA.zeros(m1, m2)
    numblocks1 = ceil(Int, m1/16)
    numblocks2 = ceil(Int, m2/16)
    rng        = default_rng(typeof(A))
    begin
        @cuda threads=(16, 16) blocks=(numblocks1, numblocks2) kernel_rand!(A, rng.state)
    end
    return A
end

Aren’t you measuring just the CPU time to launch the kernel? So far I measured everything with

@btime CUDA.@sync

or

@benchmark CUDA.@sync

and the times are one order of magnitude larger.

Thanks for your version of rand_2! with gpu_call. If I define a non-allocating version of myrand above and run

function test_rand()
    A = zeros(1024, 1024)
    B = cu(A)
    C = cu(A)
    D = cu(A)
    @btime rand!($A)
    @btime CUDA.@sync CUDA.rand!($B)
    @btime CUDA.@sync myrand!($C)
    @btime CUDA.@sync rand_2!($D)
    #display(C)
end

I obtain

532.867 μs (0 allocations: 0 bytes)
34.562 μs (11 allocations: 192 bytes)
89.899 μs (18 allocations: 432 bytes)
87.430 μs (26 allocations: 640 bytes)

So myrand! and rand_2! seem to be equally slow compared to the CUDA implementation. But it is somehow reassuring that I did not do so badly with the slightly strange use of gpu_rand within a @cuda context (although I really don’t know how else one should go about this, without using gpu_call design as you did).

Maybe the type warnings are not as big of a clue as I expected, as rand_2! also produces them, but @device_code_warntype doesn’t produce any for neither rand_2! nor myrand!, and the CPU kernel launch times do not account for the large discrepancy in timings.

This post is about sampling binomial random variates on the GPU

Thanks for mentioning this reference, I hadn’t looked at it.
Yes, the binomial samplers tend to look a bit scary, this one especially.

For what it’s worth, here is a completely naive algorithm (summing bernoulli samples). I should say that for my application I want to sample a matrix whose i’th row shares the same parameter p[i], but other variants should be easy to accommodate. Using the same design as for myrand above,

function kernel_binomial_naive!(n, k, p, r)
    index1  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride1 = blockDim().x * gridDim().x
    index2  = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    stride2 = blockDim().y * gridDim().y
    
    for i in index1:stride1:size(k, 1), j in index2:stride2:size(k, 2)
            k[i, j] = 0
            for m in 1:n[i,j]                                     
                @inbounds k[i,j] += gpu_rand(Float32, CuKernelContext(), r) < p[i]
            end
    end
    return
end

function rand_binomial_naive(n, p)
    m1, m2     = size(n)
    k          = CUDA.zeros(m1, m2)
    numblocks1 = ceil(Int, m1/16)
    numblocks2 = ceil(Int, m2/16)
    r          = default_rng(CuArray{Float32, 2})
    begin
        @cuda threads=(16, 16) blocks=(numblocks1, numblocks2) kernel_binomial_naive!(n, k, p, r.state)
    end
    return k
end

It takes about 12ms to sample an 1024x1024 array. This is actually competitive with PyTorch (approx 11ms, if I’m measuring correctly),

# Python 3
import torch

shape  = [1024, 1024]
counts = torch.cuda.FloatTensor([[128.] * shape[0]] * shape[1])
probs  = torch.rand([shape[0]], device='cuda:0')

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
samples = torch.binomial(count = counts, prob = probs)
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end))

which uses the more advanced BTRS algorithm.
Note that because the n parameter may be different for each sample, table methods which are fast if you need many samples from the same distribution are not quite appropriate for my use case.
I implemented the similar BTRD algorithm (taken almost straight from tensorflow) below, which is about 30% faster and therefore faster than PyTorch. My goal is now to find out how much faster I can get.

xt1mx(x) = x*(1-x)
xd1mx(x) = x/(1-x)

function stirling_approx_tail(k)
    if k == 0
        return 0.0810614667953272
    elseif k == 1
        return 0.0413406959554092
    elseif k == 2
        return 0.0276779256849983
    elseif k == 3
        return 0.0207906721037650
    elseif k == 4
        return 0.0166446911898211
    elseif k == 5
        return 0.0138761288230707
    elseif k == 6
        return 0.0118967099458917
    elseif k == 7
        return 0.0104112652619720
    elseif k == 8
        return 0.00925546218271273
    elseif k == 9
        return 0.00833056343336287
    end
    kp1sq = (k + 1) * (k + 1);
    return (1.0 / 12 - (1.0 / 360 - 1.0 / 1260 / kp1sq) / kp1sq) / (k + 1)
end


                                        
function kernel_BTRD!(n, k, p, randstates)
    index1  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride1 = blockDim().x * gridDim().x
    index2  = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    stride2 = blockDim().y * gridDim().y
    
    @inbounds for i in index1:stride1:size(k, 1)
        # BTRD approximations work well for p <= 0.5
        pp         = min(p[i], 1-p[i])                                 
                                                
        # edge case
        if pp == 0
            for j in index2:stride2:size(k, 2)
                k[i, j] = 0
            end
            continue
        end
            
        # general case
        logp       = log(1-pp)
        r          = xd1mx(pp)
        s          = xt1mx(pp)
                                                
        @inbounds for j in index2:stride2:size(k, 2)
            # trivial case                                            
            if n[i, j] == 0
                k[i, j] = 0
                continue
            end

            # Use inversion algorithm for n*p < 10
            if n[i, j] * pp < 10
                geom_sum = 0.
                num_geom = 0
                while true
                    geom      = ceil(log(gpu_rand(Float32, CuKernelContext(), randstates)) / logp)
                    geom_sum += geom
                    if geom_sum > n[i, j]
                        break
                    end
                    num_geom += 1
                end
                k[i, j] = num_geom
                continue
            end
            
            # BTRD algorithm
            k[i,j]     = -1
            stddev     = sqrt(n[i,j] * s)
            b          = 1.15 + 2.53 * stddev
            a          = -0.0873 + 0.0248 * b + 0.01 * pp
            c          = n[i,j] * pp + 0.5
            v_r        = 0.92 - 4.2 / b
    
            alpha      = (2.83 + 5.1 / b) * stddev;
            m          = floor((n[i,j] + 1) * pp)
                                                    
            while true                                      
                if k[i, j] >= 0
                    break
                end
                                                        
                usample = gpu_rand(Float32, CuKernelContext(), randstates) - 0.5
                vsample = gpu_rand(Float32, CuKernelContext(), randstates)
                                                        
                us = 0.5 - abs(usample)
                ks = floor((2 * a / us + b) * usample + c)
                                                        
                if us >= 0.07 && vsample <= v_r
                    k[i, j] = ks
                end
                
                if ks < 0 || ks > n[i, j]
                    continue
                end
            
                v2 = log(vsample * alpha / (a / (us * us) + b))
                ub = (m + 0.5) * log((m + 1) / (r * (n[i, j] - m + 1))) +
                     (n[i, j] + 1) * log((n[i, j] - m + 1) / (n[i, j] - ks + 1)) +
                     (ks + 0.5) * log(r * (n[i, j] - ks + 1) / (ks + 1)) +
                     stirling_approx_tail(m) + stirling_approx_tail(n[i, j] - m) - stirling_approx_tail(ks) - stirling_approx_tail(n[i, j] - ks)
                if v2 <= ub
                    k[i, j] = ks
                end
            end
            k[i,j] = relu(k[i,j])
                                                    
            # if pp = 1 - p[i] was used, do k <- n - k                                        
            if pp > 0.5                                             
                k[i, j] = n[i, j] - k[i, j]
            end
        end
    end
    return
end

function rand_binomial_BTRD(n, p)
    m1, m2     = size(n)
    k          = CUDA.zeros(m1, m2)
    numblocks1 = ceil(Int, m1/16)
    numblocks2 = ceil(Int, m2/16)
    r          = default_rng(CuArray{Float32, 2})
    begin
        @cuda threads=(16, 16) blocks=(numblocks1, numblocks2) kernel_BTRD!(n, k, p, r.state)
    end
    return k
end

Maybe you already know this, but for the Poisson distribution cuRAND actually has an API, see curandGeneratePoisson in https://docs.nvidia.com/cuda/curand/host-api-overview.html, and there is a wrapper for this in CUDA.jl/lib/curand/random.jl and CUDA.jl/lib/curand/libcurand.jl .

So you can actually call rand_poisson!(A, lambda = 100).
In contrast, for the binomial, the wrapper is there in CUDA.jl/lib/curand/libcurand.jl but there is no counterpart in cuRAND.

So I wonder why you need a custom kernel for the Poisson distribution?

EDIT: on further thought, maybe because you need to broadcast over an array of parameters as I need to do for the binomial?

Woops! Yes I think you are right. My timings with CUDA.@sync are very similar to yours, then.

The developers for JuliaGPU have office hours tomorrow I think, I can ask them what is up with this.

I am trying to get the NSight profiler working but so far have not been able to on my system.

Yes, I need a poisson sampler to call inside of a more complex kernel. As far as I know, the CURAND one will just give me an array of poisson distributed values, and it will use far too much memory to precompute the poisson samples I need and send it to my custom kernel. Moreover, I don’t statically know the number of samples I need, and so I would need to overestimate, which would be slow.

Interesting, nice work!

Besides using gpu_rand, the other way that they suggest is to use a counter based RNG as described here. This method seems to use up my GPU memory very quickly though.

I did not get that to work. My sampled arrays had repeating patterns. That’s probably because I did not understand how to adapt the counter to my array indexing using the @cuda (which is different enough from using FLoops to put a for loop on the GPU to confuse me).

Using Float32's for the constants and computing the Stirling tail for k > 0 gave a more than double speedup of BTRD over the previous version. It now runs in under 4ms, which is three times as fast as PyTorch. Still, I believe that it is possible to make it faster.

function stirling_approx_tail(k)::Float32
    if k == 0
        return 0.0810614667953272f0
    end
    kp1sq = (k + 1f0) * (k + 1f0);
    return (1.0f0 / 12 - (1.0f0 / 360 - 1.0f0 / 1260 / kp1sq) / kp1sq) / (k + 1)
end


                                        
function kernel_BTRD!(n, k, p, randstates)
    index1  = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride1 = blockDim().x * gridDim().x
    index2  = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    stride2 = blockDim().y * gridDim().y
    
    @inbounds for i in index1:stride1:size(k, 1)
        # BTRD approximations work well for p <= 0.5
        invert = p[i] > 0.5
        pp     = invert ? 1-p[i] : p[i]
                                                
        # edge case
        if pp == 0
            for j in index2:stride2:size(k, 2)
                k[i, j] = 0
            end
            continue
        end
            
        # general case
        logp       = log(1-pp)
        r          = pp/(1-pp)
        s          = pp*(1-pp)
                                                
        @inbounds for j in index2:stride2:size(k, 2)
            # trivial case                                            
            if n[i, j] == 0
                k[i, j] = 0
                continue
            end
            
            # Use naive algorithm for n <= 17
            if n[i, j] <= 17
                k[i, j] = 0
                for m in 1:n[i,j]                                     
                    k[i,j] += GPUArrays.gpu_rand(Float32, CUDA.CuKernelContext(), randstates) < p[i]
                end
                continue
            end
            
            # Use inversion algorithm for n*p < 10
            if n[i, j] * pp < 10f0
                geom_sum = 0f0
                num_geom = 0
                while true
                    geom      = ceil(log(GPUArrays.gpu_rand(Float32, CUDA.CuKernelContext(), randstates)) / logp)
                    geom_sum += geom
                    geom_sum > n[i, j] && break
                    num_geom += 1
                end
                k[i, j] = num_geom
                continue
            end
            
            # BTRD algorithm
            k[i,j]     = -1
            stddev     = sqrt(n[i,j] * s)
            b          = 1.15f0 + 2.53f0 * stddev
            a          = -0.0873f0 + 0.0248f0 * b + 0.01f0 * pp
            c          = n[i,j] * pp + 0.5f0
            v_r        = 0.92f0 - 4.2f0 / b
    
            alpha      = (2.83f0 + 5.1f0 / b) * stddev;
            m          = floor((n[i,j] + 1) * pp)
                                                    
            while true                                      
                k[i, j] >= 0 && break
                                                        
                usample = GPUArrays.gpu_rand(Float32, CUDA.CuKernelContext(), randstates) - 0.5f0
                vsample = GPUArrays.gpu_rand(Float32, CUDA.CuKernelContext(), randstates)
                                                        
                us = 0.5f0 - abs(usample)
                ks = floor((2 * a / us + b) * usample + c)
                                                        
                if us >= 0.07f0 && vsample <= v_r
                    k[i, j] = ks
                end
                
                if ks < 0 || ks > n[i, j]
                    continue
                end
            
                v2 = log(vsample * alpha / (a / (us * us) + b))
                ub = (m + 0.5f0) * log((m + 1) / (r * (n[i, j] - m + 1))) +
                     (n[i, j] + 1) * log((n[i, j] - m + 1) / (n[i, j] - ks + 1)) +
                     (ks + 0.5f0) * log(r * (n[i, j] - ks + 1) / (ks + 1)) +
                     stirling_approx_tail(m) + stirling_approx_tail(n[i, j] - m) - stirling_approx_tail(ks) - stirling_approx_tail(n[i, j] - ks)
                if v2 <= ub
                    k[i, j] = ks
                end
            end
            k[i,j] = max(0, k[i,j])
                                                    
            # if pp = 1 - p[i] was used, do k <- n - k                                        
            invert && (k[i, j] = n[i, j] - k[i, j])
        end
    end
    return
end

function rand_binomial_BTRD!(k, n, p)
    m1, m2     = size(n)
    numblocks1 = ceil(Int, m1/16)
    numblocks2 = ceil(Int, m2/16)
    r          = GPUArrays.default_rng(CuArray{Float32, 2})
    begin
        @cuda threads=(16, 16) blocks=(numblocks1, numblocks2) kernel_BTRD!(n, k, p, r.state)
    end
    return k
end
1 Like

Impressive! you should make this into a PR

Will do! But currently the existing samplers, e.g. CUDA.rand_poisson, do not support broadcasting over arrays of parameters. So the API for this should be carefully designed. Are there existing libraries that allow this, if only on the CPU? Of course one can use syntactic loop fusion on the CPU, i.e.

using Distributions

rand.(Binomial.(counts, probs))

but this is not supported on the GPU.

This is true. I suspect that the maintainers of CUDA.jl will probably have good suggestions on API though. It would be nice to get Distributions.jl integration but I have no idea how to go about that.

It came up on the GPU bi-weekly call yesterday, and it seems like the best path forward for now is the custom kernel by @simsurace, although I don’t think we should put it in CUDA.jl for now. It might be useful to put one of the above kernels in GPUArrays.jl under gpu_rand_binomial, but those APIs don’t have a way of passing arrays of parameters, so that wouldn’t help for your use case.

We’re slowly working towards getting rand() and friends working from within kernels (building on both contextual dispatch and some other compiler changes), which would make it possible to broadcast e.g. rand over an array of parameters. At that point, we can look into either supporting Distributions.jl, or temporarily adding e.g. rand_binomial to CUDA.jl, but that’s still some months away at the earliest.

So for now I should probably make a separate package, e.g. BinomialGPU.jl?

I put my code (with more improvements and functionality to accept various parameter array sizes, like pytorch) in the package BinomialGPU.jl