[ANN] SymbolicRegression.jl - distributed symbolic regression

Excited to announce SymbolicRegression.jl, a high-performance package for learning equations via regularized evolution! It supports distributed computing, allows user-defined operators (including discontinuous ones), and can export to SymbolicUtils.jl.

Here’s a demo:


You can install it with Pkg.add("SymbolicRegression")

Check out the repo and documentation. There’s also a Python frontend: PySR.

The package is built using lightweight binary tree that allow for quick allocation of new equations, and pre-compiled memory-efficient evaluation of arbitrary equations with degree=1,2 operators. It uses a combination of regularized evolution, simulated annealing, and gradient-free optimization (via Optim.jl). Importantly there are no gradients used in the package, which allows one to work with discontinuous functions such as logical operators, and operators with singularities in general.

SymbolicRegression.jl is built for regression problems - you want to find an analytic function f:\mathbb{R}^m\rightarrow\mathbb{R} such that f(X_{i, :}) \approx y_i \forall i for an input X and y. For finding differential equations via gradient descent, you should check out SINDy which is implemented in DataDrivenDiffEq.jl.

Symbolic regression via evolution like in SymbolicRegression.jl, in general, scales quite badly with the number of input features—this is the drawback of avoiding gradients. To get around this you can either estimate feature importances and only include the most important, or, to work with a very large number of input features (such as an N-Body dataset), you can try this method: [2006.11287] Discovering Symbolic Models from Deep Learning with Inductive Biases, which has code here (blogpost here). Essentially you first fit a neural network to your data with a sparsity regularization, and then fit equations to the trained sparse network. This also allows one to apply other (differentiable) constraints to a learned model - e.g., one can learn Lagrangians by training a Lagrangian Neural Network, and then fitting equations to the Lagrangian term in the learned model.

Thanks to @patrick-kidger and @ChrisRackauckas for helping make the code more generic and easy-to-use, @shashi @mason and @yingboma for building SymbolicUtils.jl (used for internal simplification and export), and @marius311 @Skoffer @Henrique_Becker for advice on this forum which led to big improvements in speed!

Would love to hear any feedback, tips, ideas, etc.
Cheers!
Miles

32 Likes

Hi @MilesCranmer and thanks for this package!

  1. what kinds of cases would I prefer this over traditional supervised ML methods such as Gradient Boosting?
  2. Is it possible to include function composition as a binary operator?
    binary_operators=(+, *, /, -,∘) doesn’t work
1 Like

Whoa…I didn’t even know about symbolic regression. I can’t wait to try this out!

4 Likes

Thanks!

Re: @Albert_Zevelev, Great questions!

1

Pros: (i) easy to interpret the trained model and relate to existing theory, (ii) compact, (iii) fast evaluation, (iv) potentially significantly better generalization (see page 9) for physical systems, maybe due to analytic functions being a good basis for representing physics (look up any physics 101 cheat sheet—it’s entirely analytic expressions!). Cons: (i) slow to train, (ii) scales very badly with respect to number of input features, (iii) can be very difficult to train and tune, (iv) underdeveloped field.

(interestingly, modern ML has nearly the opposite pros/cons! I guess you can take your pick depending on the problem. Or you can mix them as done in that paper.)

2

The only operators allowed (at least for now) are f:\mathbb{R}^n \rightarrow \mathbb{R} for n\in \{1, 2\}. This rigidity allows better performance since I can pre-compile the evaluation of arbitrary trees, and fuse operators together to allocate fewer temporary arrays. You could compose some operators beforehand if you have prior knowledge that some particular composition will be good for your domain, but unfortunately I don’t have it set up to do that automatically yet.

But this is a really interesting idea! It seems like allowing function composition would allow for repeating patterns in an equation without the search needing to find the terms separately… I will think about how this could be implemented.

Cheers,
Miles

3 Likes

That’s cool. Some other (similar) interpretation (I think). Keep in mind this might sound like a lot of jibberish but it’s just based on my intuition. It depends on data <-> data <-> data <-> data (bell’s inequalities, I think best left as a probabilistic interpretation (but depending on the ambient program you could guess the discrete df and give it stochastic support.). things can normalize to 1, but then you reach some curse of dimensionality problems and you almost have to picture a (workgroup of fpus and mmus are "communicating to each other*. In practice, some things are commutative and some things anti-commute*, you can easily form algebras with those.
https://theworld.com/~reinhold/bellsinequalities.html
The Geneva Convention On The Treatment of Object Aliasing ← you are an x. if you generalize for example a k-cfa algo like julia’s multimethod arch, how is y effected by x. how were one to decide on a data flow to communicate and prevent any data race problems (I’m thinking from many different per spectives, but primarily from a macro/micro task optimization of recurrent stuff on a modular cross platform web platform - e,g https://servo.org/ (max code automation, min time with a bunch of match statements in rust and cargo hauling json group ‘blobs’ and differentiation ‘svg’/‘css’ ‘state’ as a driver peturbs it relative to deno/v8) each level I’m trying to jump between a bunch of ‘test cases’ that implement something similar ti v8 without giving it too much ‘free’ cross-entropy but implement cross platform cross language transpilation so that it can run on new hardware like WebXR but try to band limit digital = ‘perfect finite matching timeslices of some fourier-like number’ perspective )

Also I think Tropical Geometry (think about higher order than hessians, but quicker shortest paths to change from dynamic state to dynamic state. it will get you in the right ballpark if you have to backtrack. what invariances are broken?) and Morse Theory (if you had to measure a bunch of information and a bunch of other communication how are they related? usually enumerative quantification is important because there will almost always be some discretization problems) help in practice improve things on modern systems (in cloud, but the addresses probably need to be sanitized relative to other similar things that differentiate between states).

Electrodynamics/high energy physics perspective (maybe better from a quantum information science perspective):
I like the interpretation of “coupled electrodynamics” because it seemed to ‘explain’ transistor/integrated circuit miniaturization since about 1959, as well as a lot of the superconductor stuff that turned out to be too brittle. It turns out if you decouple the idea of the electric field and the magnetic field, in a post moores law era you might want to start thinking of other types of storing ‘communication flow’ besides elections (I don’t even know what the name of the netherian current but maybe graphene? wild stab in the dark. how about a more integrated (ensembles of) models where statistical physics models apply? how is error tracked matched in what type of invariance management? maybe in the future we’d need to talk in the language of bose-einstein condensates, some of the quantum fractional hall effects seem like they are more “wild” “exotic” states of matter than you think if you look at new archs). For example, in theory (but keep in mind different companies are chasing these)

Computer Engineering (the physics of computer science or computer architecture) Perspective:
keep this in mind, speculative execution was/is great to parallelize a bunch of distributed work in some pipeline, but even late '90s were trying to figure out from the hardware side, do we run both ways of a joint/marginal ‘branch’ and execute both ways (speculative and ‘differentiate’ them. what would be the witnesses of the differentiation and integration? how coupled and dense is the compute? how quick is it and how does it vary with the voltage and heat. These days, there is some software that is trying to simulate similar things I think SIR (swift IR) and MLIR (seem to be designed in mind to exploit. imho there are deep connections I see in group theory, galois theory, etc):

  • keep in mind some of these are good and bad (it depends on the design and what else you’re doing). I remember ‘accidently’ encountering this in the late 90s with maximum likelyhood (squarefree) inference on even older ‘archs’:
    Meltdown (security vulnerability) - Wikipedia

@sashmit, it is not by improving an absurdity that we prove a certain intelligence: it is by suppressing it (Jean de Lattre de Tassigny)

Wow! That’s really cool. Coming from a more classic stats background, symbolic regression (at least conceptually) seems intuitive. Anyone have any recommended reading on theory? I would absolutely love to try this package out.