We want to start a new bachelor project soon, with the goal of using Genetic Programming to somehow predict equations of motion from timeseries. There has been quite some work on this field, notably the publications

Distilling Free-Form Natural Laws from Experimental Data, M. Schmidt and H. Lipson, Science 2009

Automated reverse engineering of nonlinear dynamical systems, J. Bongard, H. Lipson, PNAS 2007

The book by Koza : “Genetic Programming: On the Programming of Computers by Means of Natural Selection”

I have found that this package exists: ExprOptimization.jl which is part of Stanford Intelligent Systems Laboratory and seems to be developed solely by @rcnlee.

I was wondering whether the community has more suggestions, or if the author (Richie Lee) has some comments on whether this package is a good place to start. My only fear is that ExprOptimization.jl is under-documented, which will make things hard for a bachelor student, or anyone who would like to expand the package, really.

In addition, the speed seems to be on the slow side as well, as I have been running the example given in the notebook for more than 10 minutes now and it still hasn’t finished computing. Of course I have to say that I have never used Genetic Programming before, so I do not have any clue on how fast these things should be!

In general we would like to create something that can be incorporated into the DynamicalSystems.jl ecosystem and the two goals we have are:

Predict system conserved quantities using Julia

Predict or at least indicate form that equations of motion might have, again using Julia.

P.S.: If this topic is better suited in another category, please do not hesitate to move it.

The senior author is a big proponent of Julia as well. If you’re interested, we can discuss building something in Julia based on this.

Instead of optimizing over expressions, what you can do instead is put together a more generic equation of method which has all of the terms you think you may need, and then do an optimization with a regularization which is chosen to sparsify the solution.

This reminded me of an article I saw about deducing the Maxwell Equations from experimental data (with Julia).
I am not sure how relevant is this since from what I understand it is more related to AI.
You can find more info here: [1708.04927] TheoSea: Marching Theory to Light, also

Thanks for your kind replies! @SebastianM-C I have also seen this talk, but they don’t seem to have any code anywhere.

@mohamed82008 Thank you for your suggestion. Unfortunately I simply could not understand what this package is about. The first paragraph of the readme left me utterly confused:

using BlackBoxOptim
function rosenbrock2d(x)
return (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
end

We can now call the bboptimize() function, specifying the function to be optimized (here: rosenbrock2d()) and the range of values allowed for each of the dimensions of the input:

res = bboptimize(rosenbrock2d; SearchRange = (-5.0, 5.0), NumDimensions = 2)

We get back an optimization result object that we can query to, for example, get the best fitness and solution candidate:

best_fitness(res) < 0.001 # Fitness is typically very close to zero here...
length(best_candidate(res)) == 2 # We get back a Float64 vector of dimension 2

What are they even optimizing?.. I see no definitinion of loss function or anything. I truly have no idea what the package does Reading their README even further I realized that this is not really related to what I am looking for. But I am really happy you tried to help!

It’s an optimization package, algorithms to find the minimum of functions, in the case the Rosenbrock function, which is a classical function to test optimization algorithms. In your project you probably want to write down some sort of objective function and optimize it, so that’s where an optimization package comes in.

That said Python’s implementation of CMA-ES is probably better than BlackBoxOptim algorithms for most problems.

As @jonathanBieler said, the loss function is the Rosenbrock function. Loss function is just a code name for any function you are trying to minimize, doesn’t have to be some sum of square error, could be any function that you define as long as it is bounded from below (i.e. has some minimum). Think of the input x to the function above as the parameters of a regression model, and the function returns the equivalent of the sum of square error. Most of the unconstrained single objective evolutionary algorithms are actually pretty simple so it should be relatively easy to dive in deeper in the package, if you want, and then maybe make it faster. Or you can use the Python alternatives instead.

Speed depends on the number of variables and the computational cost of evaluating the objective. I think 10s of variables is probably a practical limit. 100s is really stretching it, but if your objective is so cheap to evaluate, then it is possible. Anyway, the algorithm can give you the best solution found after N iterations (or function evaluations) that you can specify, so you can control the time yourself, just not the quality of the solution.

Ok thanks. Unfortunately I do not have enough experience on the topic

You both suggested Python but the project has to be done in Julia. If you we cannot proceed with genetic programming we will find another project to implement in Julia.

EDIT:

Well that of course obviously understandable. But the example I run had only a single variable and 2 multiplications and additions. So I would say is pretty small. So do you have any estimate on “time to solve per iteration / per population size” ? In my laptop took 10 minutes, which I though was very slow, given that crazy complicated ANN models can take a couple minutes only.

Note that DiffEqParamEstim.jl is a already setup with testing for NLopt.jl, BlackBoxOptim.jl, and Evolutionary.jl (its master branch… it really needs a tag). DiffEqParamEstim is documented here:

to enforce sparsity constraints, in which case it’s then not just about predicting parameters, but predicting which parameters are zero as well and thus constraining the possible equations of motion from a starting set.
Note that these convenience functions autodiff through the solver and stuff like that to give gradients to the optimization algorithm.

Note we are building very similar Bayesian tools which may be helpful here too.

10 minutes for 1 variable and 2 multiplications and additions is pretty darn slow. You can literally try every value the variable could take (with some small step size) if the problem is this simple and then just choose the best solution!

I never used ExprOptimization before but the example seems to be doing symbolic regression with an expression tree of maximum depth 10, that’s not one variable! In the worst case this could be up to 2^{10}-1 function nodes! I never did this before though, so I am not sure of the details of the package or the technique used.

Also, if you can get a cheap estimate of the derivative, then genetic algorithms is not the right choice, you should go for gradient-based methods like the ones in NLopt. I think ForwardDiff should make gradient based methods almost equally “black-box” as genetic algorithms (as long as the loss function is compatible with ForwardDiff), and if the loss function is sufficiently smooth, gradient-based methods could easily beat genetic algorithms, I think.

I am the main developer of ExprOptimization.jl. The package provides a number of algorithms to solve GP-style expression optimization problems for Julia expressions, so it is particularly well-suited for symbolic regression problems.

As you can see from the example notebook, it’s pretty quick to get started. You specify a grammar and a loss function and you’re good to go to try different algorithms. I think it’s quite suitable for a student project.

In addition to the example notebook, there is documentation in the source files. I am also still actively maintaining and working on the package, so you can always reach out to me or file an issue on github if there are any issues.

The algorithms actually spend the majority of the time evaluating the fitness function (as it should). You can verify this by timing the fitness function. The symbolic regression example in the notebook had some poorly chosen parameters. I’ve made some adjustments that make the example run a lot faster now, so please take a look again at the example. It should take less than half a minute to run now. In particular, (1) I sped up the fitness function by adjusting the discretization; (2) reduced the max_depth of the derivation trees searched; (3) reduced the number of evaluations of the fitness function by using smaller population sizes and iterations.

Feel free to contact me directly if you would like to discuss further. My email is on the repo’s wiki.

I really appreciate your reply. Reading your answer in detail, I feel much more confident in suggesting the mentioned project as one of the available projects to the student.

We also have a lot of ideas on how one can add more functionality than what already exists, for example the partial derivatives method used in the Science paper I mentioned as well as a completely novel application.

For more info, I will contact you as you suggested. I am also very happy that you are willing to help with issues on using the package!