Why isn’t Symbolic Regression used more?

Thanks to its ability to dynamically compile expression and execute it efficiently, Julia positions itself very strongly in the symbolic regression landscape.

Symbolic regression offers a way to discover formulas with fast inference time, is resistant to overfitting, and is very interpretable.

However, the power of symbolic regression is yet to be fully explored, for example…

  • For OCR, you can symbolically regress the way to preprocess the data with simple filters and operations in order to make the OCR work correctly.
  • For time series data, you can symbolically regress in the SSA (single static assignment) form and keep some variables as the state for the next time step.

And yes, I do have a symbolic regression code for my private project, but it’s private and very ad-hoc, so it’s not suitable to be posted here.

I wonder why isn’t symbolic regression in various forms used more. Am I missing something?

4 Likes

This project seems healthy and used GitHub - MilesCranmer/SymbolicRegression.jl: Distributed High-Performance Symbolic Regression in Julia

5 Likes

Sure, that covers basic symbolic regression, but there are still recurrent symbolic regression tricks or combining it with image filters for image preprocessing and so on.

It sounds appealing, but I find it a bit hard to conceptualize the space of functions that we’re sampling from. What are the biggest success stories of symbolic regression?

Good questions and I agree with everything you said, SR should be used more! :slight_smile:

The package that my group works on, SymbolicRegression.jl, is built to be hackable and modular so you can plug it into other libraries. On the package’s forum there are a variety of custom loss functions that people have built: MilesCranmer/PySR · Discussions · GitHub. One of my favorite examples is rediscovering the recursive relation for the Mandelbrot set from data (i.e., members of the Mandelbrot set): Can PySR handle recursive functions? · MilesCranmer/PySR · Discussion #540 · GitHub. Other examples include optimising an expression for a particular asymptotic behavior: Constraining asymptotic behavior · MilesCranmer/PySR · Discussion #324 · GitHub which uses Richardson.jl in the loss function, or symbolically “solving” integrals by optimizing the Zygote.jl-computed derivatives of an expression: Symbolic integrals with SymbolicRegression.jl · MilesCranmer/PySR · Discussion #401 · GitHub. You can even use it for finding fast numerical approximations to specific functions: How can I re-discover the Fast Inverse Square Root function? · MilesCranmer/PySR · Discussion #469 · GitHub. Or designing fast FPGA programs for doing anomaly detection at the LHC [2305.04099] Symbolic Regression on FPGAs for Fast Machine Learning Inference — using Julia’s fixed-point number libraries to simulate the FPGA numerics! (Instead of “complexity”, for that project one optimizes for number of FPGA cycles)

There is even an ophthalmologist on the forum who is trying to better predict surgery outcomes with it: Possible more efficient search strategy · MilesCranmer/PySR · Discussion #577 · GitHub

In addition I have a very incomplete list of user-submitted papers using SymbolicRegression.jl here: Research - PySR

Regarding extending SymbolicRegression.jl to other non-numeric data types, I’m really interested in this and would be happy to chat about joining efforts. The backend has some support for this: GitHub - SymbolicML/DynamicExpressions.jl: Ridiculously fast symbolic expressions although it’s not yet fast. One of my students at Cambridge, George-Cristian Ardeleanu, is working on making faster versions for scalar-vector-matrix SR algorithms, which he just made a PR for today: Updated OperatorEnum to use any data type (not just Numbers) by gca30 · Pull Request #85 · SymbolicML/DynamicExpressions.jl · GitHub. By the end of the summer I think this and other extensions will be production ready which should be exciting. I think it will be possible to evolve algorithms for entire imaging pipelines which will be super cool :smiley:

19 Likes

I wouldn’t downplay how big of an effect this really is. Being able to easily choose between compilation and interpretation of expressions is pretty critical for many symbolic-numeric methods, and I would put a lot of modern symbolic regression in the symbolic-numeric space since there are numerical pieces like parameter optimization mixed into the symbolic search. In one of the earlier AI phases, people tried to build specialized and dedicated hardware for Lisp solely for this purpose. Nowadays, Julia handles the JIT and the interpretation, and so it’s a pretty good time to be working on such algorithms as you’re no longer in charge of maintaining your own JIT and compiler optimizations :sweat_smile:.

I do think symbolic regression will see a lot more usage in the near future. But the key is likely not in isolation but as a process connected to other learning tools, mixing with machine learning processes and mixing with things like acausal modeling.

4 Likes

I completely agree with using Symbolic Regression more :slight_smile:.

Actually, I am working in a research project with one of the bigger companies in my country (Spain). We apply SR for an interpretable model for their industrial interests.
We try with other software (mainly because they wanted Python as language), but finally I could convince them to use PySR. The results and performance are great, and we have designed an improved methodology combining expert knowledge with SR to improve the results (we will write a research paper about that).

In my opinion, SR has several strong advantages:

  • The interpretability, crucial for scientific topics, and even engineering problems.
  • They require not many data (we tackle other ML techniques, and they required a lot more data). In our case, we have many data, but the SR works nicely with a lot less data.
  • The simplicity, it can obtain good results easily, and in the case you have information about the equations, you can use that.

I think it is mainly the lack of knowledge about that topic. Actually, although I am an expert in optimization with Metaheuristics/Evolutionary Algorithms and I had read Genetic Programming works, until recently I did not see their advantages.

The option of PySR have many advantages over other options:

  • The performance, we re-evaluate several solutions for more error measures, and the performance was a lot better than using other libraries like SymPy.
  • With PySR you can combine with Python (I personally will be happy working with DataFrames and SymbolicRegression, but I am the last one with julia knowledment in the project).
  • Flexible, you can select the complexity of operations, and constants. In our case, this allows us to use a reduced number constants increasing their penality. Even more flexible, custom operations and loss functions. Even it is better than other flexible libraries in Python (gplearn).

Your information about the future of the library seems promising, @MilesCranmer. By the way, could be great to allow us getting the population between runs in PySR (you can pause and then continue from previous state). In a previous version we could recover it and even update it, but recent internal changes make that impossible. Of course, we were using undocumented implementation details :innocent: , but it could be nice for applying custom optimizations (local search mechanisms, constraints, …). My Phd student (and myself) will be grateful in that case. We could also help with that.

5 Likes

I think you are right, we are using SR in conjunction with ML and the possibilities increase.

1 Like

Happy to hear it! And regarding this –

Please submit a bug report! It should have gotten easier after the PythonCall refactor so I am surprised to hear the opposite.

I’m very interested in this too. I’m gradually making SymbolicRegression.jl and DynamicExpressions.jl more and more generic. Eventually I’d like it to be possible overload any part of the pipeline with custom behavior. The new Expression interface should make it possible to do a lot of cool stuf: BREAKING: Change expression types to `DynamicExpressions.Expression` (from `DynamicExpressions.Node`) by MilesCranmer · Pull Request #326 · MilesCranmer/SymbolicRegression.jl · GitHub

Like learning parametric expressions/basis functions:

using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict

rng = MersenneTwister(0)
X = NamedTuple{(:x1, :x2, :x3, :x4, :x5)}(ntuple(_ -> randn(rng, Float32, 30), Val(5)))
X = (; X..., classes=rand(rng, 1:2, 30))
p1 = rand(rng, Float32, 2)
p2 = rand(rng, Float32, 2)

y = [
    2 * cos(X.x4[i] + p1[X.classes[i]]) + X.x1[i]^2 - p2[X.classes[i]] for
    i in eachindex(X.classes)
]

model = SRRegressor(;
    niterations=10,
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=10,
    expression_type=ParametricExpression,  # Subtype of `AbstractExpression`
    expression_options=(; max_parameters=2),
    autodiff_backend=:Zygote,
    parallelism=:multithreading,
)

mach = machine(model, X, y)
fit!(mach)
ypred = predict(mach, X)

so it basically learns y= 2.0 \cos(x_4 + \alpha) + x_1^2 - \beta for \alpha and \beta parameters. These can be different according to the classes parameter – here there are two classes/types of behavior. Which is different from the usual global constants, like 2.0 here.

This ParametricExpression is just a single implementation of AbstractExpression but you can see how you can do pretty custom things now.

2 Likes

Please submit a bug report! It should have gotten easier after the PythonCall refactor so I am surprised to hear the opposite.

Actually, it is not a problem with PythonCall, it is an internal change inside PySR, I will submit a suggestion in the github repository as Issue.

In a few days, I will write you with ideas to open that openness in doing SymbolicRegression more and more generic. One useful idea is to be able to recover the population through a public API, in previous version of PySR I could access, but in recent version, with the julia_state_stream_ variable, it is not possible anymore. I will ask my Phd student to write an example code as a Github Issue to give you a clear idea.

Thank you again for your work.

Thanks, I will look forward to the report.

For what it’s worth that variable is defined here: PySR/pysr/sr.py at 1327c581648adc0adad185ffd30fa9a864c63819 · MilesCranmer/PySR · GitHub. The stream_ is just a numpy array of uint8 produced by the Julia serialization (so that Python pickle can store Julia objects). But the .julia_state_ should work fine, I even have some unittests on it.