Thanks to its ability to dynamically compile expression and execute it efficiently, Julia positions itself very strongly in the symbolic regression landscape.
Symbolic regression offers a way to discover formulas with fast inference time, is resistant to overfitting, and is very interpretable.
However, the power of symbolic regression is yet to be fully explored, for example…
For OCR, you can symbolically regress the way to preprocess the data with simple filters and operations in order to make the OCR work correctly.
For time series data, you can symbolically regress in the SSA (single static assignment) form and keep some variables as the state for the next time step.
And yes, I do have a symbolic regression code for my private project, but it’s private and very ad-hoc, so it’s not suitable to be posted here.
I wonder why isn’t symbolic regression in various forms used more. Am I missing something?
Sure, that covers basic symbolic regression, but there are still recurrent symbolic regression tricks or combining it with image filters for image preprocessing and so on.
It sounds appealing, but I find it a bit hard to conceptualize the space of functions that we’re sampling from. What are the biggest success stories of symbolic regression?
I wouldn’t downplay how big of an effect this really is. Being able to easily choose between compilation and interpretation of expressions is pretty critical for many symbolic-numeric methods, and I would put a lot of modern symbolic regression in the symbolic-numeric space since there are numerical pieces like parameter optimization mixed into the symbolic search. In one of the earlier AI phases, people tried to build specialized and dedicated hardware for Lisp solely for this purpose. Nowadays, Julia handles the JIT and the interpretation, and so it’s a pretty good time to be working on such algorithms as you’re no longer in charge of maintaining your own JIT and compiler optimizations .
I do think symbolic regression will see a lot more usage in the near future. But the key is likely not in isolation but as a process connected to other learning tools, mixing with machine learning processes and mixing with things like acausal modeling.
I completely agree with using Symbolic Regression more .
Actually, I am working in a research project with one of the bigger companies in my country (Spain). We apply SR for an interpretable model for their industrial interests.
We try with other software (mainly because they wanted Python as language), but finally I could convince them to use PySR. The results and performance are great, and we have designed an improved methodology combining expert knowledge with SR to improve the results (we will write a research paper about that).
In my opinion, SR has several strong advantages:
The interpretability, crucial for scientific topics, and even engineering problems.
They require not many data (we tackle other ML techniques, and they required a lot more data). In our case, we have many data, but the SR works nicely with a lot less data.
The simplicity, it can obtain good results easily, and in the case you have information about the equations, you can use that.
I think it is mainly the lack of knowledge about that topic. Actually, although I am an expert in optimization with Metaheuristics/Evolutionary Algorithms and I had read Genetic Programming works, until recently I did not see their advantages.
The option of PySR have many advantages over other options:
The performance, we re-evaluate several solutions for more error measures, and the performance was a lot better than using other libraries like SymPy.
With PySR you can combine with Python (I personally will be happy working with DataFrames and SymbolicRegression, but I am the last one with julia knowledment in the project).
Flexible, you can select the complexity of operations, and constants. In our case, this allows us to use a reduced number constants increasing their penality. Even more flexible, custom operations and loss functions. Even it is better than other flexible libraries in Python (gplearn).
Your information about the future of the library seems promising, @MilesCranmer. By the way, could be great to allow us getting the population between runs in PySR (you can pause and then continue from previous state). In a previous version we could recover it and even update it, but recent internal changes make that impossible. Of course, we were using undocumented implementation details , but it could be nice for applying custom optimizations (local search mechanisms, constraints, …). My Phd student (and myself) will be grateful in that case. We could also help with that.
Like learning parametric expressions/basis functions:
using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict
rng = MersenneTwister(0)
X = NamedTuple{(:x1, :x2, :x3, :x4, :x5)}(ntuple(_ -> randn(rng, Float32, 30), Val(5)))
X = (; X..., classes=rand(rng, 1:2, 30))
p1 = rand(rng, Float32, 2)
p2 = rand(rng, Float32, 2)
y = [
2 * cos(X.x4[i] + p1[X.classes[i]]) + X.x1[i]^2 - p2[X.classes[i]] for
i in eachindex(X.classes)
]
model = SRRegressor(;
niterations=10,
binary_operators=[+, *, /, -],
unary_operators=[cos, exp],
populations=10,
expression_type=ParametricExpression, # Subtype of `AbstractExpression`
expression_options=(; max_parameters=2),
autodiff_backend=:Zygote,
parallelism=:multithreading,
)
mach = machine(model, X, y)
fit!(mach)
ypred = predict(mach, X)
so it basically learns y= 2.0 \cos(x_4 + \alpha) + x_1^2 - \beta for \alpha and \beta parameters. These can be different according to the classes parameter – here there are two classes/types of behavior. Which is different from the usual global constants, like 2.0 here.
This ParametricExpression is just a single implementation of AbstractExpression but you can see how you can do pretty custom things now.
Please submit a bug report! It should have gotten easier after the PythonCall refactor so I am surprised to hear the opposite.
Actually, it is not a problem with PythonCall, it is an internal change inside PySR, I will submit a suggestion in the github repository as Issue.
In a few days, I will write you with ideas to open that openness in doing SymbolicRegression more and more generic. One useful idea is to be able to recover the population through a public API, in previous version of PySR I could access, but in recent version, with the julia_state_stream_ variable, it is not possible anymore. I will ask my Phd student to write an example code as a Github Issue to give you a clear idea.