[ANN] SymbolicRegression.jl - distributed symbolic regression

Excited to announce SymbolicRegression.jl, a high-performance package for learning equations via regularized evolution! It supports distributed computing, allows user-defined operators (including discontinuous ones), and can export to SymbolicUtils.jl.

Here’s a demo:


You can install it with Pkg.add("SymbolicRegression")

Check out the repo and documentation. There’s also a Python frontend: PySR.

The package is built using lightweight binary tree that allow for quick allocation of new equations, and pre-compiled memory-efficient evaluation of arbitrary equations with degree=1,2 operators. It uses a combination of regularized evolution, simulated annealing, and gradient-free optimization (via Optim.jl). Importantly there are no gradients used in the package, which allows one to work with discontinuous functions such as logical operators, and operators with singularities in general.

SymbolicRegression.jl is built for regression problems - you want to find an analytic function f:\mathbb{R}^m\rightarrow\mathbb{R} such that f(X_{i, :}) \approx y_i \forall i for an input X and y. For finding differential equations via gradient descent, you should check out SINDy which is implemented in DataDrivenDiffEq.jl.

Symbolic regression via evolution like in SymbolicRegression.jl, in general, scales quite badly with the number of input features—this is the drawback of avoiding gradients. To get around this you can either estimate feature importances and only include the most important, or, to work with a very large number of input features (such as an N-Body dataset), you can try this method: [2006.11287] Discovering Symbolic Models from Deep Learning with Inductive Biases, which has code here (blogpost here). Essentially you first fit a neural network to your data with a sparsity regularization, and then fit equations to the trained sparse network. This also allows one to apply other (differentiable) constraints to a learned model - e.g., one can learn Lagrangians by training a Lagrangian Neural Network, and then fitting equations to the Lagrangian term in the learned model.

Thanks to @patrick-kidger and @ChrisRackauckas for helping make the code more generic and easy-to-use, @shashi @mason and @yingboma for building SymbolicUtils.jl (used for internal simplification and export), and @marius311 @Skoffer @Henrique_Becker for advice on this forum which led to big improvements in speed!

Would love to hear any feedback, tips, ideas, etc.
Cheers!
Miles

40 Likes

Hi @MilesCranmer and thanks for this package!

  1. what kinds of cases would I prefer this over traditional supervised ML methods such as Gradient Boosting?
  2. Is it possible to include function composition as a binary operator?
    binary_operators=(+, *, /, -,∘) doesn’t work
1 Like

Whoa…I didn’t even know about symbolic regression. I can’t wait to try this out!

4 Likes

Thanks!

Re: @Albert_Zevelev, Great questions!

1

Pros: (i) easy to interpret the trained model and relate to existing theory, (ii) compact, (iii) fast evaluation, (iv) potentially significantly better generalization (see page 9) for physical systems, maybe due to analytic functions being a good basis for representing physics (look up any physics 101 cheat sheet—it’s entirely analytic expressions!). Cons: (i) slow to train, (ii) scales very badly with respect to number of input features, (iii) can be very difficult to train and tune, (iv) underdeveloped field.

(interestingly, modern ML has nearly the opposite pros/cons! I guess you can take your pick depending on the problem. Or you can mix them as done in that paper.)

2

The only operators allowed (at least for now) are f:\mathbb{R}^n \rightarrow \mathbb{R} for n\in \{1, 2\}. This rigidity allows better performance since I can pre-compile the evaluation of arbitrary trees, and fuse operators together to allocate fewer temporary arrays. You could compose some operators beforehand if you have prior knowledge that some particular composition will be good for your domain, but unfortunately I don’t have it set up to do that automatically yet.

But this is a really interesting idea! It seems like allowing function composition would allow for repeating patterns in an equation without the search needing to find the terms separately… I will think about how this could be implemented.

Cheers,
Miles

3 Likes

That’s cool. Some other (similar) interpretation (I think). Keep in mind this might sound like a lot of jibberish but it’s just based on my intuition. It depends on data ↔ data ↔ data ↔ data (bell’s inequalities, I think best left as a probabilistic interpretation (but depending on the ambient program you could guess the discrete df and give it stochastic support.). things can normalize to 1, but then you reach some curse of dimensionality problems and you almost have to picture a (workgroup of fpus and mmus are "communicating to each other*. In practice, some things are commutative and some things anti-commute*, you can easily form algebras with those.
https://theworld.com/~reinhold/bellsinequalities.html
http://gee.cs.oswego.edu/dl/aliasing/aliasing.html ← you are an x. if you generalize for example a k-cfa algo like julia’s multimethod arch, how is y effected by x. how were one to decide on a data flow to communicate and prevent any data race problems (I’m thinking from many different per spectives, but primarily from a macro/micro task optimization of recurrent stuff on a modular cross platform web platform - e,g https://servo.org/ (max code automation, min time with a bunch of match statements in rust and cargo hauling json group ‘blobs’ and differentiation ‘svg’/‘css’ ‘state’ as a driver peturbs it relative to deno/v8) each level I’m trying to jump between a bunch of ‘test cases’ that implement something similar ti v8 without giving it too much ‘free’ cross-entropy but implement cross platform cross language transpilation so that it can run on new hardware like WebXR but try to band limit digital = ‘perfect finite matching timeslices of some fourier-like number’ perspective )

Also I think Tropical Geometry (think about higher order than hessians, but quicker shortest paths to change from dynamic state to dynamic state. it will get you in the right ballpark if you have to backtrack. what invariances are broken?) and Morse Theory (if you had to measure a bunch of information and a bunch of other communication how are they related? usually enumerative quantification is important because there will almost always be some discretization problems) help in practice improve things on modern systems (in cloud, but the addresses probably need to be sanitized relative to other similar things that differentiate between states).

Electrodynamics/high energy physics perspective (maybe better from a quantum information science perspective):
I like the interpretation of “coupled electrodynamics” because it seemed to ‘explain’ transistor/integrated circuit miniaturization since about 1959, as well as a lot of the superconductor stuff that turned out to be too brittle. It turns out if you decouple the idea of the electric field and the magnetic field, in a post moores law era you might want to start thinking of other types of storing ‘communication flow’ besides elections (I don’t even know what the name of the netherian current but maybe graphene? wild stab in the dark. how about a more integrated (ensembles of) models where statistical physics models apply? how is error tracked matched in what type of invariance management? maybe in the future we’d need to talk in the language of bose-einstein condensates, some of the quantum fractional hall effects seem like they are more “wild” “exotic” states of matter than you think if you look at new archs). For example, in theory (but keep in mind different companies are chasing these)

Computer Engineering (the physics of computer science or computer architecture) Perspective:
keep this in mind, speculative execution was/is great to parallelize a bunch of distributed work in some pipeline, but even late '90s were trying to figure out from the hardware side, do we run both ways of a joint/marginal ‘branch’ and execute both ways (speculative and ‘differentiate’ them. what would be the witnesses of the differentiation and integration? how coupled and dense is the compute? how quick is it and how does it vary with the voltage and heat. These days, there is some software that is trying to simulate similar things I think SIR (swift IR) and MLIR (seem to be designed in mind to exploit. imho there are deep connections I see in group theory, galois theory, etc):

  • keep in mind some of these are good and bad (it depends on the design and what else you’re doing). I remember ‘accidently’ encountering this in the late 90s with maximum likelyhood (squarefree) inference on even older ‘archs’:
    Meltdown (security vulnerability) - Wikipedia

@sashmit, it is not by improving an absurdity that we prove a certain intelligence: it is by suppressing it (Jean de Lattre de Tassigny)

Wow! That’s really cool. Coming from a more classic stats background, symbolic regression (at least conceptually) seems intuitive. Anyone have any recommended reading on theory? I would absolutely love to try this package out.

Nice package! I quickly played around with it, trying to find the equation for a rather simple function of a single variable. Had to run it with quite a high population + iterations count though to have it finde the correct structure of the equation.

Three ideas for potential improvement:

  • How easily could local optimisation of coefficients / constants be included? (At least for cases with only continuous operators.) This could largely improve the convergence behaviour.
  • Intermediate structural simplification. I often saw terms like (x + x) or (x * 2.072...) + x, which is then penalised for complexity which it does not actually have.
  • Termination criterion: always hard with those heuristic gradient-free methods, but could help. Or just some measure of improvement over the iterations, to be able to manually stop when there was no further progress for many iterations.
1 Like

Thanks for trying it out!

What equation did you try it with?

Note that there was recently a (silent) bug in it, which was live over the past week - it hurt the performance quality significantly. This was fixed in 0.12.6: Fix search performance problem #148 by MilesCranmer · Pull Request #149 · MilesCranmer/SymbolicRegression.jl · GitHub. If you try again with 0.12.6 it will do much much better.

How easily could local optimisation of coefficients / constants be included? (At least for cases with only continuous operators.) This could largely improve the convergence behaviour.

This is done using Optim.jl: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/master/src/ConstantOptimization.jl

Intermediate structural simplification. I often saw terms like (x + x) or (x * 2.072...) + x, which is then penalised for complexity which it does not actually have.

These tend to go away over time as it prioritizes simpler equations to game the loss function. You can play with the parsimony parameter to encourage this more or less. Again, you might just be seeing this due to the bug I mentioned above.

I did try including SymbolicUtils.jl for simplification at one point but it ended up slowing the search down by quite a bit, compared to just letting the evolutionary algorithm go to town on expressions.

Termination criterion: always hard with those heuristic gradient-free methods, but could help. Or just some measure of improvement over the iterations, to be able to manually stop when there was no further progress for many iterations.

Right now there are a few stopping criteria: (1) you can provide max_evals to terminate after a number of evaluations (this is more to test algorithmic efficiency than anything else), (2) you can provide timeout_in_seconds to terminate after a certain amount of time has passed, (3) maximum number of generations of equations (using niterations), 43) earlyStopCondition which you can pass a function like f(loss, complexity) = (loss < 1e-6) && (complexity < 10) which stops when the condition is hit on an equation, (5) you can manually stop it by hitting q<enter> during the search once you are satisfied.

Convergence-based criteria are really tricky for evolution based algorithms. They don’t really converge at all; if you let them keep evolving, eventually they may find a new branch of the evolutionary tree, and essentially start an entirely new search from that point. Just consider the evolution of living species as an example - will evolution really ever converge to a global optima?

2 Likes

Here’s what I use to test. It’s just some arbitrary function of one variable I came up with. The shape (non-monotonic, high order required for a polynomial to represent it) and also the structure of the equation are typical for many real-world applications I have.

f_act(x) = exp(0.4 + 3x + sin(1.5*π*x))
y_noise = 0.15
N_sample = 100

x_sample = rand(N_sample)
y_sample = f_act.(x_sample) + y_noise*randn(N_sample)

true function and samples

I updated my project and now used SymbolicRegression 0.13.2. Still it seems to have a hard time finding the correct solution if I set (+, *) as binary, and (sin, exp) as unary operators. For the algorithm implemented, is a larger population or more iterations more promising? How many overall evaluations (population * iterations) should it take for such a function to be found?

Concerning evolution of living species: They can increase complexity arbitrarily, there’s no extra penalty for this. If the added complexity serves as resilience or redundancy feature, it even makes the species more robust (as compared to the usual case of a more complex system being more fragile). Further, evolution in biology has to adapt to a constantly changing environment, which is not the case with symbolic regression. Once the correct underlying relation is found, evolution cannot improve any further. So, there will always be convergence, either to some local minimum (and yes, in this case, it might happen that by luck, a better branch is found after many unsuccessful iterations), or the global one.

1 Like

Sorry for jumping in but I’m really interested in this, so I’m wondering about the solution that SymbolicRegression.jl did find: Did it fit the data as well and what was the resulting equation?

It did find the exact equation, although somewhat “hidden”, i.e. not fully simplified:

exp((sin(x * 4.7209109567059935) + 0.39632173439843266) + ((x1 * 2.011697653059265) + x1))

What do you mean by:

Did it fit the data as well […]?

I don’t think there’s any means to explicitly account for measurement noise, is there? @MilesCranmer

I don’t think there’s any means to explicitly account for measurement noise, is there?

Generally you don’t need to worry much about noise for symbolic regression, as it is not flexible enough to overfit to outlier points. However, you could use Options(; loss=L1DistLoss()) if you want to be more robust against outliers (since L1 finds the median function of the data, whereas the default L2 finds the mean function)

Also, for measurement noise that is different at different points (though it is not the case here), you could either:

  1. Define a custom loss function as the log-likelihood of some noise model, or
  2. Use EquationSearch(X, y; weights=1 ./ sigma .^2, options=options), where sigma is the (Gaussian) noise at each point.
2 Likes

Thanks for the details! I played around with the hyperparams a bit and was able to find the expression exactly with the following settings. I’m not sure if all of them were necessary, and also not sure if I just got a lucky seed, but I usually do things like this to get functions in a specific form.

options = Options(;
    unary_operators=(sin, exp),
    binary_operators=(+, *), 
    turbo=true, 
    nested_constraints=[sin => [sin => 0], exp => [exp => 0]], 
    maxsize=30, 
    npopulations=40, 
    parsimony=0.01,
)
EquationSearch(
    x_sample', 
    y_sample; 
    options=options, 
    parallelism=:multithreading, 
    niterations=10000
)

Let’s walk through these:

  • turbo=true this uses LoopVectorization.jl for faster evaluation. (Added v0.14)
  • nested_constraints=[sin => [sin => 0], exp => [exp => 0]] This states that sin may never appear inside a sin, and exp may never appear inside an exp. This is a little bit cheating, but I actually use constraints like this normally, since sin(sin(...)) seems super unusual. In other words, this is like a prior on what I think the expression should look like. It is surprising how many accurate expressions there are which are very nested - so you need to artificially punish them to keep things interpretable.
  • maxsize=30 While your true equation does not need this complexity, it can actually help a lot if you give the search “more space” to move around. It’s often easier for it to simplify from a complex and redundant expression than it is to miraculously construct the exact expression from simpler ones.
  • npopulations=40 This basically just does a “broader” search. Probably not necessary though.
  • parsimony=0.01 This is honestly the parameter I have the least intuition for. When computing the fitness for an individual, this is the factor in front of the complexity penalty. It seems like you should play around with this parameter over a few smaller runs.
  • niterations=10000 I typically like to just let it run as long as it takes. Sometimes it will seem like it converged, but then it randomly happens upon a different branch of expression space, and starts exploring from scratch from that tree.

In practice, I typically take the deep learner’s approach and just throw like 800 cores at the search for a few days (see this example) and even if my hyperparameters are non-optimal, it eventually finds whatever complex expression I am looking for :slight_smile:

1 Like

One other thing to try is the warmup_maxsize_by=0.5 parameter. This slowly increases the maxsize over the course of the search (here, at 50% of the way through the search, it will reach maxsize), which can prevent the search from falling into a bad local minima early on.

Great package. I am wondering if it is possible to use this package to search for basis functions and their coefficients for a fitting problem.

Say I have a function f(x; p) where p is a control parameter. I have a list of basis functions \{\phi_i\} and its associated coefficients \{c_i\}. \{c_i\} is independent to p. I want to find \{c_i\} to fit a range of p values, such that

f(x; p) = \sum_i c_i \phi_i(x; p).

For symbolic regression over a basis, SINDy is what you are looking for. Note that you can’t fit internal coefficients though; standard SINDy optimizes a sparse set of multiplicative coefficients in a sum over a basis.

There’s a flexibility-speed tradeoff to keep in mind. SINDy is faster but not expressive in terms of what equations are possible; SymbolicRegression.jl is slower but very expressive.

If you do want something in the middle, you could probably hack your own evaluator scheme, as described in this comment.

1 Like

Thanks for your suggestions. They are really helpful!

1 Like

Awesome, thanks. With the constraints, the correct equation is quickly found. I do not think it’s “cheating”, but rather restricting the search to a more reasonable subspace. Quite cool that this is possible.

I really think I can use SymbolicRegression in the future, for “real” problems!

2 Likes