Why isn’t Symbolic Regression used more?

Tarny_GG_Channie · June 25, 2024, 6:18pm

Thanks to its ability to dynamically compile expression and execute it efficiently, Julia positions itself very strongly in the symbolic regression landscape.

Symbolic regression offers a way to discover formulas with fast inference time, is resistant to overfitting, and is very interpretable.

However, the power of symbolic regression is yet to be fully explored, for example…

For OCR, you can symbolically regress the way to preprocess the data with simple filters and operations in order to make the OCR work correctly.
For time series data, you can symbolically regress in the SSA (single static assignment) form and keep some variables as the state for the next time step.

And yes, I do have a symbolic regression code for my private project, but it’s private and very ad-hoc, so it’s not suitable to be posted here.

I wonder why isn’t symbolic regression in various forms used more. Am I missing something?

Krastanov · June 25, 2024, 8:20pm

This project seems healthy and used GitHub - MilesCranmer/SymbolicRegression.jl: Distributed High-Performance Symbolic Regression in Julia

Tarny_GG_Channie · June 26, 2024, 12:05am

Sure, that covers basic symbolic regression, but there are still recurrent symbolic regression tricks or combining it with image filters for image preprocessing and so on.

cstjean · June 26, 2024, 12:56am

It sounds appealing, but I find it a bit hard to conceptualize the space of functions that we’re sampling from. What are the biggest success stories of symbolic regression?

MilesCranmer · June 26, 2024, 2:28am

Good questions and I agree with everything you said, SR should be used more!

The package that my group works on, SymbolicRegression.jl, is built to be hackable and modular so you can plug it into other libraries. On the package’s forum there are a variety of custom loss functions that people have built: MilesCranmer/PySR · Discussions · GitHub. One of my favorite examples is rediscovering the recursive relation for the Mandelbrot set from data (i.e., members of the Mandelbrot set): Can PySR handle recursive functions? · MilesCranmer/PySR · Discussion #540 · GitHub. Other examples include optimising an expression for a particular asymptotic behavior: Constraining asymptotic behavior · MilesCranmer/PySR · Discussion #324 · GitHub which uses Richardson.jl in the loss function, or symbolically “solving” integrals by optimizing the Zygote.jl-computed derivatives of an expression: Symbolic integrals with SymbolicRegression.jl · MilesCranmer/PySR · Discussion #401 · GitHub. You can even use it for finding fast numerical approximations to specific functions: How can I re-discover the Fast Inverse Square Root function? · MilesCranmer/PySR · Discussion #469 · GitHub. Or designing fast FPGA programs for doing anomaly detection at the LHC [2305.04099] Symbolic Regression on FPGAs for Fast Machine Learning Inference — using Julia’s fixed-point number libraries to simulate the FPGA numerics! (Instead of “complexity”, for that project one optimizes for number of FPGA cycles)

There is even an ophthalmologist on the forum who is trying to better predict surgery outcomes with it: Possible more efficient search strategy · MilesCranmer/PySR · Discussion #577 · GitHub

In addition I have a very incomplete list of user-submitted papers using SymbolicRegression.jl here: Research - PySR

Regarding extending SymbolicRegression.jl to other non-numeric data types, I’m really interested in this and would be happy to chat about joining efforts. The backend has some support for this: GitHub - SymbolicML/DynamicExpressions.jl: Ridiculously fast symbolic expressions although it’s not yet fast. One of my students at Cambridge, George-Cristian Ardeleanu, is working on making faster versions for scalar-vector-matrix SR algorithms, which he just made a PR for today: Updated OperatorEnum to use any data type (not just Numbers) by gca30 · Pull Request #85 · SymbolicML/DynamicExpressions.jl · GitHub. By the end of the summer I think this and other extensions will be production ready which should be exciting. I think it will be possible to evolve algorithms for entire imaging pipelines which will be super cool

ChrisRackauckas · June 26, 2024, 2:47am

I wouldn’t downplay how big of an effect this really is. Being able to easily choose between compilation and interpretation of expressions is pretty critical for many symbolic-numeric methods, and I would put a lot of modern symbolic regression in the symbolic-numeric space since there are numerical pieces like parameter optimization mixed into the symbolic search. In one of the earlier AI phases, people tried to build specialized and dedicated hardware for Lisp solely for this purpose. Nowadays, Julia handles the JIT and the interpretation, and so it’s a pretty good time to be working on such algorithms as you’re no longer in charge of maintaining your own JIT and compiler optimizations .

I do think symbolic regression will see a lot more usage in the near future. But the key is likely not in isolation but as a process connected to other learning tools, mixing with machine learning processes and mixing with things like acausal modeling.

dmolina · June 26, 2024, 7:30pm

I completely agree with using Symbolic Regression more .

Actually, I am working in a research project with one of the bigger companies in my country (Spain). We apply SR for an interpretable model for their industrial interests.
We try with other software (mainly because they wanted Python as language), but finally I could convince them to use PySR. The results and performance are great, and we have designed an improved methodology combining expert knowledge with SR to improve the results (we will write a research paper about that).

In my opinion, SR has several strong advantages:

The interpretability, crucial for scientific topics, and even engineering problems.
They require not many data (we tackle other ML techniques, and they required a lot more data). In our case, we have many data, but the SR works nicely with a lot less data.
The simplicity, it can obtain good results easily, and in the case you have information about the equations, you can use that.

I think it is mainly the lack of knowledge about that topic. Actually, although I am an expert in optimization with Metaheuristics/Evolutionary Algorithms and I had read Genetic Programming works, until recently I did not see their advantages.

The option of PySR have many advantages over other options:

The performance, we re-evaluate several solutions for more error measures, and the performance was a lot better than using other libraries like SymPy.
With PySR you can combine with Python (I personally will be happy working with DataFrames and SymbolicRegression, but I am the last one with julia knowledment in the project).
Flexible, you can select the complexity of operations, and constants. In our case, this allows us to use a reduced number constants increasing their penality. Even more flexible, custom operations and loss functions. Even it is better than other flexible libraries in Python (gplearn).

Your information about the future of the library seems promising, @MilesCranmer. By the way, could be great to allow us getting the population between runs in PySR (you can pause and then continue from previous state). In a previous version we could recover it and even update it, but recent internal changes make that impossible. Of course, we were using undocumented implementation details , but it could be nice for applying custom optimizations (local search mechanisms, constraints, …). My Phd student (and myself) will be grateful in that case. We could also help with that.

dmolina · June 26, 2024, 7:36pm

I think you are right, we are using SR in conjunction with ML and the possibilities increase.

MilesCranmer · June 26, 2024, 9:18pm

Happy to hear it! And regarding this –

Please submit a bug report! It should have gotten easier after the PythonCall refactor so I am surprised to hear the opposite.

I’m very interested in this too. I’m gradually making SymbolicRegression.jl and DynamicExpressions.jl more and more generic. Eventually I’d like it to be possible overload any part of the pipeline with custom behavior. The new Expression interface should make it possible to do a lot of cool stuf: BREAKING: Change expression types to `DynamicExpressions.Expression` (from `DynamicExpressions.Node`) by MilesCranmer · Pull Request #326 · MilesCranmer/SymbolicRegression.jl · GitHub

Like learning parametric expressions/basis functions:

using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict

rng = MersenneTwister(0)
X = NamedTuple{(:x1, :x2, :x3, :x4, :x5)}(ntuple(_ -> randn(rng, Float32, 30), Val(5)))
X = (; X..., classes=rand(rng, 1:2, 30))
p1 = rand(rng, Float32, 2)
p2 = rand(rng, Float32, 2)

y = [
    2 * cos(X.x4[i] + p1[X.classes[i]]) + X.x1[i]^2 - p2[X.classes[i]] for
    i in eachindex(X.classes)
]

model = SRRegressor(;
    niterations=10,
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=10,
    expression_type=ParametricExpression,  # Subtype of `AbstractExpression`
    expression_options=(; max_parameters=2),
    autodiff_backend=:Zygote,
    parallelism=:multithreading,
)

mach = machine(model, X, y)
fit!(mach)
ypred = predict(mach, X)

so it basically learns y= 2.0 \cos(x_4 + \alpha) + x_1^2 - \beta for \alpha and \beta parameters. These can be different according to the classes parameter – here there are two classes/types of behavior. Which is different from the usual global constants, like 2.0 here.

This ParametricExpression is just a single implementation of AbstractExpression but you can see how you can do pretty custom things now.

dmolina · June 27, 2024, 6:07pm

Please submit a bug report! It should have gotten easier after the PythonCall refactor so I am surprised to hear the opposite.

Actually, it is not a problem with PythonCall, it is an internal change inside PySR, I will submit a suggestion in the github repository as Issue.

In a few days, I will write you with ideas to open that openness in doing SymbolicRegression more and more generic. One useful idea is to be able to recover the population through a public API, in previous version of PySR I could access, but in recent version, with the julia_state_stream_ variable, it is not possible anymore. I will ask my Phd student to write an example code as a Github Issue to give you a clear idea.

Thank you again for your work.

MilesCranmer · June 27, 2024, 6:36pm

Thanks, I will look forward to the report.

For what it’s worth that variable is defined here: PySR/pysr/sr.py at 1327c581648adc0adad185ffd30fa9a864c63819 · MilesCranmer/PySR · GitHub. The stream_ is just a numpy array of uint8 produced by the Julia serialization (so that Python pickle can store Julia objects). But the .julia_state_ should work fine, I even have some unittests on it.

Topic		Replies	Views
[ANN] SymbolicRegression.jl 1.0.0 - Distributed High-Performance Symbolic Regression in Julia Package Announcements package , symbolic-regression	24	1501	November 29, 2024
[ANN] SymbolicRegression.jl - distributed symbolic regression Package Announcements package , announcement , symbolic	22	3864	March 13, 2025
Compiling specialized functions for large set of user-passed options Performance question , package , optimization	4	611	January 30, 2021
Using Neural Network for regression (in Julia) Machine Learning	11	5111	March 6, 2017
Subset sum problem with SymbolicRegressions.jl General Usage symbolic-regression	7	623	April 14, 2023

Why isn’t Symbolic Regression used more?

Related topics