Implementing a basic Genetic Algorithm using GPUArrays

So I like playing with Genetic Algorithms and other Evolutionary Algorithms and have gotten nice speedups via using threads before.
So I saw GPUArrays and felt it was a good enough reason to setup a linux box again!
My basic implementation I start with is just solving the simple spherical function i.e. all inputs = 0
fitness = sum(chromosome .* chromosome ← slow version of the fitness function.

The individual is just a custom type, can be mutable or immutable with just an array of floats and a single value for fitness.
So I grab my old implementation and do the conversion and I see the error: ERROR: LoadError: ArgumentError: CuArray with non-bit element type not supported
Quickly checking then Array of Arrays and I get the same error using the CUBackend. CL backend gives different errors:
ERROR: LoadError: argument is an abstract type; size is indeterminate for an array of structs and
ERROR: LoadError: type does not have a canonical binary representation for an array of arrays.

What would then be the recommended way to initialise a population on the GPU? It would either just be a Vector{Vector{Float64}} which would then have a second Vector{Float64} containing the fitness or just Vector{Individua} for the standard way of implementing it. However, neither of those work!

So what am I missing here? Is it possible to implement this right now using GPUArrays or would I need to use CUDAnative directly (can I use it directly to do this even?)

Thanks for the assistance!

GPUArray(x::Vector{Individua}) with Individua being an immutable, concrete struct.
Your error suggests, that you don’t have a concrete type. So make sure you’re not using something like:

immutable Individua{T} end
Vector{Individua}(...) <-- Individua is abstract

If this doesn’t work, a nice short reproducible code sample would be great!
With GPUArrays, you can also use the JLBackend, to test if there is no error in the Julia code - this should give you better error messages. Make sure to be on master with GPUArrays right now, that should have a lot of bug fixes - waiting for a final fix before tagging.

I see I made a typo -_- Individual not Individua

This is the old version which uses mutable struct not struct

before I altered any logic I made a minimal example;

Which should be a concrete type? Base.isabstract returns false for what its worth.
I am on the master, have been since your massive pull request was merged!

Thanks for the help and this amazing package

1 Like

Ahh, you can’t just have a vector in there…
Can you use a tuple with a fixed size instead?
Otherwise, you’d need to go through some lengths to get variable length vectors in there.
E.g. upload it to another GPUArray and then just store ranges in Individual.

Well, that was a great tip!
I ended up using staticarrays SVector as it also happened to work in place of a tuple (I haven’t used tuples much) I wonder though why it is not possible to use an MVector? I have large structures that would be more suitable to mutate than continuously copy. That being said - SVector does work!

The documentation for gpu_call is out of date. So are the examples. I used the tests as examples. The switch of function and array threw me for a loop as the error was very different from what I needed.
State was also confusing until I realised that state was the count of the array this function was being launched at. So instead of an internal loop - just assign with state as the position (I think, JLBackend gave good results).

Still haven’t gotten it running on the GPU. I need CUDAnativelibs for random numbers gpu side it seems. Or it may be something else - I am getting told kernel functions may not return Union{} where the code would suggest Void in its place. However, further reading and editing the offending line makes me think it could be the random number generator which is causing issues.

If you have a chance - what exactly do you mean by upload it to another GPU array then just store ranges?
Initialise the arrays separately device side and store the device pointers only in Individual? I’m not too sure what you meant by that.

Just repeating for visibility: What stops a MVector from being used in the struct where SVector has no issues? Is this a permanent limitation or something that may be possible with a more mature library?

Thanks for being a great help to a newcomer to GPU stuff!
I can’t say thanks enough really. After hours of frustration - a solution has been found! Progress has been made!

This is the code up to where I am at:

I don’t know much about GPUs in Julia, but I know enough to answer part of your question:

The reason that SVector works where MVector does not is that an SVector is a “bits” type: A bits type is one that is immutable and is composed entirely of primitive values, tuples, or other bits types. A static, immutable vector of, for example, Float64s is a bits type, but a mutable vector is not.

Bits types are somewhat special in Julia: in particular, Julia guarantees that if you have an Array of bits types, those types will be stored directly inline in the array, and not as pointers. That’s probably why the bits distinction is important for GPU arrays.

Bits types are also super important for performance, because they can generally be stack-allocated. As a result, creating a new instance of a bits type has very little performance cost. That’s why SVectors are so useful for code in which you’re creating new vectors frequently within a loop.

So, that’s all to say that there’s not much that StaticArrays.jl could do to make an MVector the same as an SVector. It would have to be a change in Julia itself.

1 Like

Thanks! I was reading about primitive types instead of bits types it seems. That massively clears up my confusion on that part. On the rest of it though… When you get a different erroor 3 runs in a row that’s when you call it a night!

@rdeits is right :wink:

Another issue with that is, that those pointers point to heap memory. So even if you’d have a heap on the GPU (which you don’t) it would need to get transferred to the GPU heap - which isn’t impossible but someone would need to implement it.

@Mikewl, I’m sorry about the documentation, you’re a few days to early :wink:
Btw, Transpiler and even CUDAnative are not as stable as I like them to be. Even I run frequently into errors when I try new code with both of them. That’s why the Julia backend is very helpful as a sanity check.

The state is a new change, to make it easier to implement the same behaviour for different backends, e.g. the Julia threaded backend.
OpenCL/CUDA work a lot with globals to get e.g. the thread index / block size, which is more elegant to implement in a julia backend by passing the state object.

If you need random numbers inside a kernel running on the GPU, CURAND won’t work for you yet…
Which is likely why you get a kernel functions may not return Union{}, which is indicative of an error in the kernel.

For random numbers, see: CuRAND.jl wrapper · Issue #32 · JuliaGPU/GPUArrays.jl · GitHub, Random numbers in Kernel Function · Issue #40 · JuliaGPU/CUDAnative.jl · GitHub
In the last issue, there are some alternatives mentioned. If you happen to port any of these to work on the GPU, let me know. Would be nice to have them in GPUArrays as a stop gap solution.

Some algorithms in GitHub - JuliaRandom/RandomNumbers.jl: Random Number Generators for the Julia Language. might also compile for the GPU (might need some changes).

I just did some experiments with some random generator code:

struct RandomResult{T}
    state::NTuple{4, Cuint}

function TausStep(z::Unsigned, S1::Integer, S2::Integer, S3::Integer, M::Unsigned)
    b = (((z << S1) ⊻ z) >> S2)
    return (((z & M) << S3) ⊻ b)

LCGStep(z::Unsigned, A::Unsigned, C::Unsigned) = A * z + C

function Random(state::NTuple{4, T}) where T <: Unsigned
    state = (
        TausStep(state[1], Cint(13), Cint(19), Cint(12), T(4294967294)),
        TausStep(state[2], Cint(2), Cint(25), Cint(4), T(4294967288)),
        TausStep(state[3], Cint(3), Cint(11), Cint(17), T(4294967280)),
        LCGStep(state[4], T(1664525), T(1013904223))

    return RandomResult(
        Float32(2.3283064f-10 * (state[1] ⊻ state[2] ⊻ state[3] ⊻ state[4]))

This seems to work relatively well… You just need to figure out how to store the state, without getting into synchronization issues between threads. Maybe store it per Individual, and make sure that you only ever have one thread work on an individual.

You do, actually: CUDA C++ Programming Guide
We just don’t hook into that yet. There’s also a bunch of “competing” GPU allocators which supposedly offer better performance on some allocation patterns.

That said, to support CUDAnative you could use cudaconvert to convert a CPU-struct to a GPU-compatible one (eg. using CuDeviceArray instead of CuArrays as fields). But having GPUArrays do that conversions automatically is quite… magical, and I’m not sure whether that’s on @sdanisch’s TODO list.

You could also decouple the the state from the RandomResult and have something like:

states = GPUArray(Vector{NTuple{4, Cuint}}(number_of_threads))

function gpu_rand(states)
    threadid = ...# however you decided to get your threadid. Maybe GPUArrays.linear_index(states, index_state)
    stateful_rand = Random(states[threadid])
    states[threadid] = stateful_rand.state
    return stateful_rand.value

Oh funny!

It’s not too high on the list, but would be nice. I first need to figure out how to combine DeviceArrays in structs with OpenCL first -.-. But I think I’m slowly closing in on a solution for that :slight_smile:

@sdanisch that random function works perfectly it seems.
The JLBackend was amusing in that just the normal threads it added were a speedup over the standard version already. I can see myself abusing that backend for the gpu_call function. I like having state around!

I have, thanks to the assistance here, gotten much further in my understanding as well as actually getting close to getting the problem solved.

The below code is my current issue. I assume the reason it is failing is because i allocate an array. I’m not too sure how to do this without allocating a temporary array to store the new values. I do this a second time in mutate as I need to update the values individually.

I also owe you guys a beer if I am ever in that side of the world!

function crossover(parent1::Individual, parent2::Individual, states, tstate)
    xover_chance = XR
    if gpu_rand(states, tstate) < xover_chance
        newChromosome = zeros(Float32, D)
        for i = 1:length(newChromosome)
            @inbounds newChromosome[i] = (gpu_rand(states, tstate) < 0.5f0)? parent1.chromosome[i]:parent2.chromosome[i]
        return Individual(SVector{D}(newChromosome), 0.0f0)
        return parent1
newChromosome = ntuple(Val{D}) do i
    gpu_rand(states, tstate) < 0.5f0)? parent1.chromosome[i]:parent2.chromosome[i]

Should work!
Just make sure D comes from a type parameter!

That’s odd… Maybe you “accidentally” fixed some performance problems while porting it? Threads are not added automatically, you need to actually start julia with JULIA_NUM_THREADS=8 julia

Wait, you can use the do syntax like that?!

That’s going to make using immutables so much easier. Though interestingly enough - the array version allocated far far less. Unfortunately it still didn’t work

ERROR: LoadError: error compiling genIndividuals: error compiling crossover: error compiling ntuple: generic call to _ntuple requires the runtime language feature from CUBackend

   in: 30
   in: Tuple{30}
   in: SVector{30,Float32}
   in: Individual
   in: Transpiler.CLIntrinsics.CLArray{Individual,1}
   in: genIndividuals(Float32, Transpiler.CLIntrinsics.CLArray{Individual,1}, Transpiler.CLIntrinsics.CLArray{Individual,1}, Transpiler.CLIntrinsics.CLArray{NTuple{4,UInt32},1})
ERROR: LoadError: Found non concrete type: 30

from CLBackend.

With the original version from my previous post which used arrays I get the same error from CLBackend. CUBackend however has the following error:

ERROR: LoadError: error compiling genIndividuals: error compiling crossover: emit_ccall for /home/mikewl/Dropbox/Julia Code/GAs/GPUArraysGA.jl:88 requires the runtime language feature,which is disabled

The emit_ccall is being called for both return statements. I assumed earlier it was the array allocation which was giving out a different error similar to with rand. I may have been mistaken though
The D I used with the do block was a constant which from my understanding should not have caused any issues.

Immutables allowed for the parallel portion to scale better than before which caused the nice boost.

That indicates it’s not actually inferred statically as we like it… I’ll try this and add it as a test to GPUArrays, since this should work!

Of course, this creates a closure capturing the GPUArrays, which isn’t supported yet.
This works for the CU + JL backend:

using GPUArrays


parent1 = GPUArray(fill((2f0, 2f0, 3f0), 100))
parent2 = GPUArray(fill((1f0, 4f0, 3f0), 100))
gpu_rand(a, b) = 1f0
@generated function ntup_rand(::Val{N}, idx_state, states) where N
    expr = Expr(:tuple)
    for i = 1:N
        push!(expr.args, :(gpu_rand(idx_state, states)))

function crossover(parent1::NTuple{N, T}, parent2, idx_state, states) where {N, T}
    rands = ntup_rand(Val{N}(), idx_state, states) .< 0.5f0
    ifelse.(rands, parent1, parent2)

result = crossover.(parent1, parent2, 0f0, 0f0)

It works!

It looks like there is a limitation on ntuples which gives the largest ntuple allowed to be length 14. It wants to allocate otherwise.

Pretty damn happy to have gotten this working though!
Learned a lot of new things too!