# Understanding Symbolic Regression?

I’m trying to understand how to use the SymbolicRegression package in Julia. I’ve read through the thesis by Minnebo and Stijven from 2011 (referenced in Wikipedia), where they talk about “operators” vs. “terminals”.

In the Quickstart to SymbolicRegression, the following snippet occurs:

``````using SymbolicRegression

X = randn(Float32, 5, 100)
y = 2 * cos.(X[4, :]) + X[1, :] .^ 2 .- 2

options = SymbolicRegression.Options(
binary_operators=[+, *, /, -],
unary_operators=[cos, exp],
npopulations=20
)

hall_of_fame = EquationSearch(
X, y, niterations=40, options=options,
)
``````
• I assume that what Minnebo and Stijven refer to as operators is the union of `binary_operators` and `unary_operators`??
• I assume that what Minnebo and Stijvens refer to as terminals is the union of columns in `X` and a constant??, i.e., \{1, x_1,x_2,\ldots\} ??
• When building up the tree of operators, is there a way to limit the length of the branches?
• What is the meaning of `npopulations` in the options?
• When building the tree of operators, does the algorithm freely mutate on the used operators in each node? I.e., can it switch (e.g.) `*` to `+`, etc., in the various attempts? [I’m trying to understand what possible basis functions are generated…]
• How is the model from the tree of operators parameterized?
• In the `EquationSearch` method, what does `niterations` refer to?
• In the Quickstart example, the data is without noise. If the data have noise, will the method do something equivalent to balancing complexity to give good validation of the model? Is this what the Pareto-front does?

If anyone is aware of a tutorial intro to `SymbolicRegression`, please let me know :-).

Hey @BLI,

Great questions. In approximately three weeks my thesis will be up and that will include a lot of discussion about the internals of PySR/SymbolicRegress. Since that is still under review, right now I will say a couple of things. (Although if you DM me, I can send you the [draft] SymbolicRegression.jl chapter)

The symbolic regression literature is not very standardized. People approach symbolic regression from many different fields – computer science, engineering, control, natural sciences, etc – and each refer to each of these concepts in symbolic regression by many different names. In fact I wasn’t even aware of that paper you mentioned until now (I should read it!). I would say that the most well-known publication is Schmidt & Lipson, 2009 (https://science.sciencemag.org/content/324/5923/81) – which is a must-read if you have interest in this area. A classic read on genetic algorithms is Goldberg & Deb, 1991 (https://www.sciencedirect.com/science/article/pii/B9780080506845500082) – which was the first to generalize a formalized tournament selection algorithm, which is what SymbolicRegression.jl uses. The first to use genetic algorithms for symbolic regression was John Koza, 1994 (Genetic programming as a means for programming computers by natural selection | SpringerLink). Another must-read paper is Brunton, Proctor, and Kutz, 2016 (https://www.pnas.org/content/113/15/3932), although this is a linear basis method rather than a genetic algorithm (significantly faster, but less flexible). Then of course there is a lot of ongoing research as well.

For right now, I can DM you the draft chapter, and also point you to this talk: ML JC: Discovering Symbolic Models from Deep Learning with Inductive Biases (23 January 2023) · Indico, although it is very high-level and I don’t describe the internals much.

I assume that what Minnebo and Stijven refer to as operators is the union of `binary_operators` and `unary_operators`??

Again, I haven’t read their paper, and there isn’t really a standard terminology. But yes that sounds correct to me.

I assume that what Minnebo and Stijvens refer to as terminals is the union of columns in `X` and a constant??, i.e., {1,x1,x2,…} ??

Also sounds correct to me. I refer those as “leaf nodes” but again that is specific to the implementation in SymbolicRegression. For example, linear basis approach like SINDy doesn’t even have these concepts at all!

When building up the tree of operators, is there a way to limit the length of the branches?

Yes, please see API · SymbolicRegression.jl – particularly the `maxsize`, `maxdepth`, `constraints`, and `nested_constraints` options. If I understand you correctly, `maxdepth` controls the length of a tree.

What is the meaning of `npopulations` in the options?

Also listed on the docs page. This is the number of populations of equations. Since symbolic regression uses tournament selection, this is the number of separate populations undergoing evolution. There is then “migration” between populations every iteration.

When building the tree of operators, does the algorithm freely mutate on the used operators in each node? I.e., can it switch (e.g.) `*` to `+`, etc., in the various attempts? [I’m trying to understand what possible basis functions are generated…]

Yes. The mutations can be weighted by the `MutationWeights` struct, shown here: API · SymbolicRegression.jl. `mutate_operator` will resample a binary operator from the set `binary_operators`. But it cannot change binary operators into unary operators.

How is the model from the tree of operators parameterized?

Equations are parameterized using the `Node` type: Types · DynamicExpressions.jl. On the `DynamicExpressions.jl` README there is some docs on manually constructing them.

In the `EquationSearch` method, what does `niterations` refer to?

More `niterations` is like more “steps” in an optimization algorithm.

Each iteration has `ncyclesperiteration * npopulations * population_size` total mutations. There’s a migration between each iteration. And `niterations` is the total number of iterations.

In the Quickstart example, the data is without noise. If the data have noise, will the method do something equivalent to balancing complexity to give good validation of the model? Is this what the Pareto-front does?

Yes, the best equation at each complexity is returned. So if you think the high-complexity equation has overfit to the noise, then you could take a low-complexity solution instead.

However, in general, noise isn’t as much of an issue for symbolic regression compared to other ML algorithms because the search space is very small. Note that you can also pass `weights` if you want to weight data according to a per-datapoint signal-to-noise. Your loss function also determines sensitivity to noise. e.g., `L1DistLoss()` is less sensitive to anomalies than `L2DistLoss()`.

Cheers,
Miles

(Thanks for the cc Chris!)

8 Likes

Just for posterity, the methods paper for PySR & SymbolicRegression.jl how now been put online:

2 Likes