Generating Random Number from inside Kernel


#1

Is it possible to generate a random number from within a kernel?

My reading of CURAND.jl looks like curand() must be called from host.

I am implementing Genetic Algorithm. It is becoming impractical to have to pre-generate all random numbers beforehand from host and uploading them to GPU. I run out of memory very quickly.


#2

Yes there are several possibilities. The easiest is probably to use the random number generator that is part of GPUArrays should work on the device. (It isn’t documented well, but it should get you started: https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/random.jl)

@maleadt wrote this a while back and I don’t think it is part of any repository. It is a xorshift implementation.

function xorshift(x::UInt32)::UInt32
    x = xor(x, x << 13)
    x = xor(x, x >> 17)
    x = xor(x, x << 5)
    return x
end

function gpurand(state::UInt32)::Tuple{Float32,UInt32}
    state = xorshift(state)
    res = (state >> 9) | reinterpret(UInt32, 1f0)
    val = reinterpret(Float32, res) - 1f0
    return val, state
end

function gpu_walk_kernel(f, out, N, x, a, b)
    idx = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x

    # random number generation
    _, state = gpurand(idx)
    function rand()
        let state=state # JuliaLang/julia#15276
            val, state = gpurand(state)
            return val
        end
    end

    ...
end

The last idea is to actually use the device version of CURAND.jl and I attempted that in https://github.com/JuliaGPU/CUDAnativelib.jl


#3

I guess Tim’s version is based on https://github.com/JuliaGPU/CUDAnative.jl/issues/40#issuecomment-277597760


#4

@vchuravy thank you very much for your help.

I tried to implement the xor shift version.

using CUDAdrv, CUDAnative

@inline function xorshift(x::UInt32)::UInt32
    x = xor(x, x << 13)
    x = xor(x, x >> 17)
    x = xor(x, x << 5)
    return x
end

@inline function gpurand(state::UInt32)::Tuple{Float32,UInt32}
    state = xorshift(state)
    res = (state >> 9) | reinterpret(UInt32, 1f0)
    val = reinterpret(Float32, res) - 1f0
    return val, state
end

function cuda_random!(r)
    i = threadIdx().x
    idx = (blockIdx().x-UInt32(1)) * blockDim().x + threadIdx().x

    # random number generation
    val, state = gpurand(idx)
    r[i] = val
    return nothing
end

r = Array{Float32,1}(100)
d_r = CuArray(r)
@cuda (1, 100) cuda_random!(d_r)
r = Array(d_r)
println(r)

However getting this.

Float32[6.29425f-5, 0.000125885, 0.000188828, 0.00025177, 0.000314713, 0.000377655, 0.000440598, 0.00050354, 0.000566483, 0.000629425, 0.000692368, 0.00075531, 0.000818253, 0.000881195, 0.000944138, 0.0010072, 0.00107014, 0.00113308, 0.00119603, 0.00125897, 0.00132191, 0.00138485, 0.0014478, 0.00151074, 0.00157368, 0.00163662, 0.00169957, 0.00176251, 0.00182545, 0.00188839, 0.00195134, 0.0020144, 0.00195527, 0.00214028, 0.00208116, 0.00226617, 0.00220704, 0.00239205, 0.00233293, 0.00251794, 0.00245881, 0.00264382, 0.0025847, 0.00276971, 0.00271058, 0.00289559, 0.00283647, 0.0030216, 0.00296247, 0.00314748, 0.00308836, 0.00327337, 0.00321424, 0.00339925, 0.00334013, 0.00352514, 0.00346601, 0.00365102, 0.0035919, 0.00377691, 0.00371778, 0.00390279, 0.00384367, 0.0040288, 0.00409174, 0.00391054, 0.00397348, 0.00428057, 0.00434351, 0.00416231, 0.00422525, 0.00453234, 0.00459528, 0.00441408, 0.00447702, 0.00478411, 0.00484705, 0.00466585, 0.00472879, 0.005036, 0.00509894, 0.00491774, 0.00498068, 0.00528777, 0.00535071, 0.00516951, 0.00523245, 0.00553954, 0.00560248, 0.00542128, 0.00548422, 0.00579131, 0.00585425, 0.00567305, 0.00573599, 0.0060432, 0.00598407, 0.00592494, 0.00586581, 0.00629497]

The number appears quite random. However, not uniformly random between 0 and 1.

It will be too technical for me to implement/fix a random number generator.


#5

Just use the one in GPUArrays, it should work perfectly fine with CLArrays && CuArrays!

Here is an example of how you can use it:

I should have a look if we can integrate xorshift as well… Looks a bit simpler then what I ported!


#6

Here are the important lines:

function kernel_rand!(state, randstate)
    random_number = GPUArrays.gpu_rand(Float32, state, randstate)
    return
end
A = .... # your gpu array, CuArray / CLArray
randstate = GPUArrays.cached_state(A)
gpu_call(kernel_rand, A, (randstate,))

#7

@vchuravy and @sdanisch I may have resolved the xorshift version.

@inline function rand_float32(state::UInt32)::Tuple{Float32,UInt32}
    state = xorshift(state)
    res =  state  | reinterpret(UInt32, 1f0)
    val = reinterpret(Float32, res) - 1f0
    return val, state
end

Yields

Summary Stats:
Mean:           0.503629
Minimum:        0.001107
1st Quartile:   0.253072
Median:         0.505304
3rd Quartile:   0.755910
Maximum:        0.999141

For 1,000 numbers. I would not know how else to test for randomness or uniformity. It seems “ok” to me now.


#8

I guess the next step now is to figure out how to store the state so that it gets used beyond a kernel call. Otherwise, I would just keep getting same sequence of numbers.

I probably would need to create an array that store a state for each idx value.


#9

Well, you can look at GPUArrays, where i do exactly that…


#10

My bad,

res =  (state >> 9) | reinterpret(UInt32, 1f0)

Is the correct version. You really need state >> 9 in there.

@sdanisch Thank you for your suggestion. The reason I have not tried GPUArrays yet is because I got started with CUDA on Julia following this guide: https://julialang.org/blog/2017/03/cudanative which rather mentions CUArray.

I will try to explore GPUArrays in the future, but currently it seems to have some learning curve on top of the CUDANative. A good tutorial on how CUDANative and GPUArrays work together would be very helpful.


#11

CuArrays inherits from GPUArrays, so you’re basically using it already. For how to use it, you really just need the lines i posted…


#12

here some more documentation of gpu_call: https://juliagpu.github.io/GPUArrays.jl/latest/#GPUArrays.gpu_call.
Note, that with CuArrays gpu_call will directly dispatch to @cuda… Just with the added advantage, that it will also work with CLArrays.
There is also my blog post explaining some more things: https://techburst.io/writing-extendable-and-hardware-agnostic-gpu-libraries-b21c145a8dad


#13

@sdanisch thank you very much for your help. I will look into it. The support and enthusiasm of the Julia community is very promising. I already got enough help to get my work done with Julia on the GPU to solve my problem. From now on, it is just refinement and improving efficiency.