GPU Genetic Algorithm

fender9723 · April 8, 2024, 10:02pm

Hello everyone!

I’m fairly new to Julia and am attempting to enhance the performance of my code.

I have a straightforward optimization problem involving finding the maximum of a function. The variables in my problem are binary variables (0-1), and the objective function is a linear combination of these variables.

The challenge arises from having potentially 10^5-10^6 independent problems to solve. While they all share the same objective function and number of variables, each problem has its set of linear combination coefficients.

To tackle this, I’ve implemented a genetic algorithm using the CUDA.jl package and have endeavored to utilize my NVIDIA GeForce 840M to solve these problems in parallel. However, my 4 GB GPU is insufficient to accommodate 10^6 problems, so I’ve started with 100-1000 problems.

Here are my questions:

In the generation loop, I anticipate that each kernel execution completes before proceeding to the next kernel. For instance, does EvaluateFitnessGPU complete before EvaluateBestIndividualsGPU? Is this true?
I’m not particularly fond of preallocating the crossover indices idx1 and idx2 for each generation. I’ve also attempted to generate the indices within the generation loop, but I couldn’t find an efficient way to generate random integers directly using CUDA functions. The conversion from Array to CuArray is time-consuming.
Do you have any suggestions to enhance the performance of my GA?

Context:
I’m currently working on a structural analysis project related to finite element analysis. My objective is to maximize the von Mises stress (or other stress measures like the failure index or maximum principal stress) in a structure by optimizing the scaling coefficients of unit loads applied to it.

Basically, I’m searching the worst load combination for each finite element in my model.

In my analysis, the von Mises stress is computed based on a linear combination of stresses resulting from unit loads on the structure. Each unit load contributes three stress components: sxx, syy, and sxy. By scaling the unit loads with certain coefficients, we can effectively scale the resulting stresses.

To tackle this optimization problem, I’ve reformulated the scaling coefficients into binary variables. These binary variables represent discrete values that the scaling coefficients can assume, simplifying the optimization process.

ThreadsGA_3.jl (9.9 KB)

gdalle · April 9, 2024, 6:08am

Sorry for the slightly off-topic response but have you considered modeling this as an Integer Linear Program with JuMP.jl, and solving it with HiGHS.jl? If the 10^5 problems are similar and reasonably small, I think you could take advantage of a warm start mechanism to speed up resolution by a lot.

gdalle · April 9, 2024, 6:27am

Also what do the constraints look like on your binary variables?

fender9723 · April 9, 2024, 6:33am

Thank you @gdalle, but actually this is not the only function that I have to maximize. This is a test case with a linear function (it allows me to monitor the quality of the results) but in my real life problem the functions are non-linear and probably non-convex.

I’ve looked at JuMP at the beginning but I didn’t find a GPU implementation. Not looked at HiGHS.jl actually, and I’m doing right now.

To be clear: I’m not forced to use a GPU, but the high numbers of small and fast independent problems suggest a high possibility of vectorization.

My binary variables have no other constraints than being binary. That’s why in the attached file I’ve implemented my population with Bool type.

gdalle · April 9, 2024, 6:35am

Indeed JuMP doesn’t natively support GPUs for handling models, and only some of its solvers support them for finding solutions.

In that case I agree that the integer programming approach becomes much more tricky, and (meta)heuristics start to make sense. Does your problem have any physical meaning you could exploit?

gdalle · April 9, 2024, 6:37am

In this test case, you can easily find the optimum by just setting the binary variables to the sign of the associated cost coefficient, so it might be slightly too easy as a benchmark?

ChrisRackauckas · April 9, 2024, 7:06am

This is a library for nonconvex nonlinear optimizers which live as GPU kernels.

fender9723 · April 9, 2024, 7:28am

Thank you very much for your help @gdalle .

I’ll try with more complicated function later on. The computational cost of evaluating the function will anyway remain the same.

Yes, it does. I reply here, then I will modify my initial post to make it clearer.

I’m currently working on a structural analysis project related to finite element analysis. My objective is to maximize the von Mises stress (or other stress measures like the failure index or maximum principal stress) in a structure by optimizing the scaling coefficients of unit loads applied to it.

Basically, I’m searching the worst load combination.

In my analysis, the von Mises stress is computed based on a linear combination of stresses resulting from unit loads on the structure. Each unit load contributes three stress components: sxx, syy, and sxy. By scaling the unit loads with certain coefficients, we can effectively scale the resulting stresses.

To tackle this optimization problem, I’ve reformulated the scaling coefficients into binary variables. These binary variables represent discrete values that the scaling coefficients can assume, simplifying the optimization process.

fender9723 · April 9, 2024, 9:18am

Thank you very much @ChrisRackauckas , I took a look to PSOGPU.jl.

Unfortunately, it does not have an explicit way to optimize multiple independent problems in parallel, which is the main reason I developed a GA myself.

But this is interesting anyway. Do you know if there is a ready-to-go way to specify binary/integer variables?

ChrisRackauckas · April 9, 2024, 9:20am

It does not do mixed integer.

What do you mean by this? Maybe we could incorporate it.

fender9723 · April 9, 2024, 9:32am

Maybe you missed the context of my application (I’ve edited the original post):

fender9723:

Context:
I’m currently working on a structural analysis project related to finite element analysis. My objective is to maximize the von Mises stress (or other stress measures like the failure index or maximum principal stress) in a structure by optimizing the scaling coefficients of unit loads applied to it.

Basically, I’m searching the worst load combination for each finite element in my model.

In my analysis, the von Mises stress is computed based on a linear combination of stresses resulting from unit loads on the structure. Each unit load contributes three stress components: sxx, syy, and sxy. By scaling the unit loads with certain coefficients, we can effectively scale the resulting stresses.

To tackle this optimization problem, I’ve reformulated the scaling coefficients into binary variables. These binary variables represent discrete values that the scaling coefficients can assume, simplifying the optimization process.

Imagine you have thousands of elements, and you want to find the worst load combination for each of them. Those are independent optimization problem.

ChrisRackauckas · April 9, 2024, 9:44am

I see, I missed the binary nature of the problem in a first read. You can adapt PSO other other evolutionary approaches to such problems, but it’s generally not so efficient. But there may not be a better option here, at least for now.

But yes, GPUPSO is currently setup a little differently. It solves A global optimization problem on the GPU with a particle swarm (an algorithm quite similar in some respects to a genetic algorithm), distributing the steps of the swarm across the GPU. However, its asynchronous form uses the fact that the PSO steps itself are compilable as GPU kernels. If that was exposed a bit differently that could be an API for running a GPU parallel solve where each solve is a particle swarm itself. Then you’d just need to do one of the many approaches to make a PSO handle mixed integer, like https://www.sciencedirect.com/science/article/abs/pii/S0951832008002251, and you’d be good.

We can open an issue on GPUPSO and see how hard it would be to expose that.

fender9723 · April 9, 2024, 10:07am

Fortunately, evolutionary methods have demonstrated good performance (they got almost always to global optimum) in the test cases I experimented with in both MATLAB and Python (with standard CPU implementations). Before transitioning to Julia, I utilized MATLAB (of which I have 5 years of intense programming activities) and then had a try with Python.

If you’re wondering why I didn’t use MATLAB GPU computing, it’s because it requires the Parallel Computing Toolbox which is very expensive.

Anyway, genetic algorithms or particle swarm optimization appear to be promising options for addressing my current problem.

I have attached some code with a basic implementation.

Does it works? It looks like it does, but I’m unexperienced in Julia and I’m sure that it’s inefficient and probably contains some rookie mistake in handling kernels and memory. Maybe it can inspire a more skilled programmer to come with a more efficient solution.