Hey @BLI,
Great questions. In approximately three weeks my thesis will be up and that will include a lot of discussion about the internals of PySR/SymbolicRegress. Since that is still under review, right now I will say a couple of things. (Although if you DM me, I can send you the [draft] SymbolicRegression.jl chapter)
The symbolic regression literature is not very standardized. People approach symbolic regression from many different fields – computer science, engineering, control, natural sciences, etc – and each refer to each of these concepts in symbolic regression by many different names. In fact I wasn’t even aware of that paper you mentioned until now (I should read it!). I would say that the most well-known publication is Schmidt & Lipson, 2009 (https://science.sciencemag.org/content/324/5923/81) – which is a must-read if you have interest in this area. A classic read on genetic algorithms is Goldberg & Deb, 1991 (https://www.sciencedirect.com/science/article/pii/B9780080506845500082) – which was the first to generalize a formalized tournament selection algorithm, which is what SymbolicRegression.jl uses. The first to use genetic algorithms for symbolic regression was John Koza, 1994 (Genetic programming as a means for programming computers by natural selection | SpringerLink). Another must-read paper is Brunton, Proctor, and Kutz, 2016 (https://www.pnas.org/content/113/15/3932), although this is a linear basis method rather than a genetic algorithm (significantly faster, but less flexible). Then of course there is a lot of ongoing research as well.
For right now, I can DM you the draft chapter, and also point you to this talk: ML JC: Discovering Symbolic Models from Deep Learning with Inductive Biases (23 January 2023) · Indico, although it is very high-level and I don’t describe the internals much.
Here are some specific answers:
I assume that what Minnebo and Stijven refer to as operators is the union of binary_operators
and unary_operators
??
Again, I haven’t read their paper, and there isn’t really a standard terminology. But yes that sounds correct to me.
I assume that what Minnebo and Stijvens refer to as terminals is the union of columns in X
and a constant??, i.e., {1,x1,x2,…} ??
Also sounds correct to me. I refer those as “leaf nodes” but again that is specific to the implementation in SymbolicRegression. For example, linear basis approach like SINDy doesn’t even have these concepts at all!
When building up the tree of operators, is there a way to limit the length of the branches?
Yes, please see API · SymbolicRegression.jl – particularly the maxsize
, maxdepth
, constraints
, and nested_constraints
options. If I understand you correctly, maxdepth
controls the length of a tree.
What is the meaning of npopulations
in the options?
Also listed on the docs page. This is the number of populations of equations. Since symbolic regression uses tournament selection, this is the number of separate populations undergoing evolution. There is then “migration” between populations every iteration.
When building the tree of operators, does the algorithm freely mutate on the used operators in each node? I.e., can it switch (e.g.) *
to +
, etc., in the various attempts? [I’m trying to understand what possible basis functions are generated…]
Yes. The mutations can be weighted by the MutationWeights
struct, shown here: API · SymbolicRegression.jl. mutate_operator
will resample a binary operator from the set binary_operators
. But it cannot change binary operators into unary operators.
How is the model from the tree of operators parameterized?
Equations are parameterized using the Node
type: Types · DynamicExpressions.jl. On the DynamicExpressions.jl
README there is some docs on manually constructing them.
In the EquationSearch
method, what does niterations
refer to?
More niterations
is like more “steps” in an optimization algorithm.
Each iteration has ncyclesperiteration * npopulations * population_size
total mutations. There’s a migration between each iteration. And niterations
is the total number of iterations.
In the Quickstart example, the data is without noise. If the data have noise, will the method do something equivalent to balancing complexity to give good validation of the model? Is this what the Pareto-front does?
Yes, the best equation at each complexity is returned. So if you think the high-complexity equation has overfit to the noise, then you could take a low-complexity solution instead.
However, in general, noise isn’t as much of an issue for symbolic regression compared to other ML algorithms because the search space is very small. Note that you can also pass weights
if you want to weight data according to a per-datapoint signal-to-noise. Your loss function also determines sensitivity to noise. e.g., L1DistLoss()
is less sensitive to anomalies than L2DistLoss()
.
Cheers,
Miles
(Thanks for the cc Chris!)