[ANN] SymbolicRegression.jl 1.0.0 - Distributed High-Performance Symbolic Regression in Julia

SymbolicRegression.jl

SymbolicRegression.jl is a Julia library for discovering symbolic models to solve optimization problems, especially for regression. It is highly optimized and permits multi-node distributed searches. The library is now very modular and extensible, supporting expressions with custom operators, and regressing on custom types. See the docs for some examples: Home · SymbolicRegression.jl.

It’s hard to believe it’s nearly been 4 years since my original post, so I wanted to give an update to mark the 1.0.0 release! As a summary of the most recent changes, I made this video:

Here’s a bit of a quickstart on using SymbolicRegression. I’ll show the release notes at the end.

The easiest way to use SymbolicRegression.jl is with MLJ.jl. Here’s an example:

import SymbolicRegression: SRRegressor
import MLJ: machine, fit!, predict, report

# Dataset with two named features:
X = (a = rand(500), b = rand(500))

# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2

# with some noise:
y = y .+ randn(500) .* 1e-3

model = SRRegressor(
    niterations=50,
    binary_operators=[+, -, *],
    unary_operators=[cos],
)

Now, let’s create and train this model on our data:

mach = machine(model, X, y)

fit!(mach)

Let’s look at the expressions discovered:

report(mach)

Finally, we can make predictions with the expressions on new data:

predict(mach, X)

This will make predictions using the expression selected by model.selection_method, which by default is a mix of accuracy and complexity.

You can override this selection and select an equation from the Pareto front manually with:

predict(mach, (data=X, idx=2))

where here we choose to evaluate the second equation.

For fitting multiple outputs, one can use MultitargetSRRegressor (and pass an array of indices to idx in predict for selecting specific equations). For a full list of options available to each regressor, see the API page.

1.0.0 Release Notes

Now, in the 1.0.0 there are a few major changes compared with the initial release. I give an overview of these here, and then add more detail below.

  1. Changed the core expression type from Node{T}Expression{T,Node{T},Metadata{...}}
    • This gives us new features, improves user hackability, and greatly improves ergonomics!
  2. Created “Template Expressions”, for fitting expressions under a user-specified functional form (TemplateExpression <: AbstractExpression)
    • Template expressions are quite flexible: they are a meta-expression that wraps multiple other expressions, and combines them using a user-specified function.
    • This enables vector expressions - in other words, you can learn multiple components of a vector, simultaneously, with a single expression! Or more generally, you can learn expressions onto any Julia struct.
      • (Note that this still does not permit learning using non-scalar operators, though we are working on that!)
    • Template expressions also make use of colored strings to represent each part in the printout, to improve readability.
  3. Created “Parametric Expressions”, for custom functional forms with per-class parameters: (ParametricExpression <: AbstractExpression)
    • This lets you fit expressions that act as models of multiple systems, with per-system parameters!
  4. Introduced a variety of new abstractions for user extensibility (and to support new research on symbolic regression!)
    • AbstractExpression, for increased flexibility in custom expression types.
    • mutate! and AbstractMutationWeights, for user-defined mutation operators.
    • AbstractSearchState, for holding custom metadata during searches.
    • AbstractOptions and AbstractRuntimeOptions, for customizing pretty much everything else in the library via multiple dispatch. Please make an issue/PR if you would like any particular internal functions be declared public to enable stability across versions for your tool.
    • Many of these were motivated to modularize the implementation of LaSR, an LLM-guided version of SymbolicRegression.jl, so it can sit as a modular layer on top of SymbolicRegression.jl.
  5. Added TensorBoardLogger.jl and other logging integrations via SRLogger
  6. Support for Zygote.jl and Enzyme.jl within the constant optimizer, specified using the autodiff_backend option
  7. Other changes:
    • Fundamental improvements to the underlying evolutionary algorithm
      • New mutation operators introduced, swap_operands and rotate_tree – both of which seem to help kick the evolution out of local optima.
      • New hyperparameter defaults created, based on a Pareto front volume calculation, rather than simply accuracy of the best expression.
    • Changed output file handling
    • Major refactoring of the codebase to improve readability and modularity
    • Identified and fixed a major internal bug involving unexpected aliasing produced by the crossover operator
      • Segmentation faults caused by this are a likely culprit for some crashes reported during multi-day multi-node searches.
      • Introduced a new test for aliasing throughout the entire search state to prevent this from happening again.
    • Improved progress bar and StyledStrings integration.
    • Julia 1.10 is now the minimum supported Julia version.
    • Other small features
    • Also see the “Update Guide” below for more details on upgrading.
    • New URL: https://ai.damtp.cam.ac.uk/symbolicregression

Note that some of these features were recently introduced in patch releases since they were backwards compatible. I am noting them here for visibility.

1. Changed the core expression type from Node{T} → Expression{T,Node{T},...}

This is a breaking change in the format of expressions returned by SymbolicRegression. Now, instead of returning a Node{T}, SymbolicRegression will return a Expression{T,Node{T},...} (both from equation_search and from report(mach).equations). This type is much more convenient and high-level than the Node type, as it includes metadata relevant for the node, such as the operators and variable names.

This means you can reliably do things like:

using SymbolicRegression: Options, Expression, Node

options = Options(binary_operators=[+, -, *, /], unary_operators=[cos, exp, sin])
operators = options.operators
variable_names = ["x1", "x2", "x3"]
x1, x2, x3 = [Expression(Node(Float64; feature=i); operators, variable_names) for i=1:3]

## Use the operators directly!
tree = cos(x1 - 3.2 * x2) - x1 * x1

You can then do operations with this tree, without needing to track operators:

println(tree)  # Looks up the right operators based on internal metadata

X = randn(3, 100)

tree(X)  # Call directly!
tree'(X)  # gradients of expression

Each time you use an operator on or between two Expressions that include the operator in its list, it will look up the right enum index, and create the correct Node, and then return a new Expression.

You can access the tree with get_tree (guaranteed to return a Node), or get_contents – which returns the full info of an AbstractExpression, which might contain multiple expressions (which get stitched together when calling get_tree).

2. Created “Template Expressions”, for fitting expressions under a user-specified functional form (TemplateExpression <: AbstractExpression)

Template Expressions allow users to define symbolic expressions with a fixed structure, combining multiple sub-expressions under user-specified constraints. This is particularly useful for symbolic regression tasks where domain-specific knowledge or constraints must be imposed on the model’s structure.

This also lets you fit vector expressions using SymbolicRegression.jl, where vector components can also be shared!

A TemplateExpression is constructed by specifying:

  • A named tuple of sub-expressions (e.g., (; f=x1 - x2 * x2, g=1.5 * x3)).
  • A structure function that defines how these sub-expressions are combined in different contexts.

For example, you can create a TemplateExpression that enforces the constraint: sin(f(x1, x2)) + g(x3)^2 - where we evolve f and g simultaneously.

To do this, we first describe the structure using TemplateStructure that takes a single closure function that maps a named tuple of ComposableExpression expressions and a tuple of features:

using SymbolicRegression

structure = TemplateStructure{(:f, :g)}(
  ((; f, g), (x1, x2, x3)) -> sin(f(x1, x2)) + g(x3)^2
)

This defines how the TemplateExpression should be evaluated numerically on a given input.

The number of arguments allowed by each expression object is inferred using this closure, though it can also be passed explicitly with the num_features kwarg.

operators = Options(binary_operators=(+, -, *, /)).operators
variable_names = ["x1", "x2", "x3"]
x1 = ComposableExpression(Node{Float64}(; feature=1); operators, variable_names)
x2 = ComposableExpression(Node{Float64}(; feature=2); operators, variable_names)
x3 = ComposableExpression(Node{Float64}(; feature=3); operators, variable_names)

Note that using x1 here refers to the relative argument to the expression. So the node with feature equal to 1 will reference the first argument, regardless of what it is.

st_expr = TemplateExpression(
    (; f=x1 - x2 * x2, g=1.5 * x1);
    structure,
    operators,
    variable_names
) # Prints as: f = #1 - (#2 * #2); g = 1.5 * #1

# Evaluation combines evaluation of `f` and `g`, and combines them
# with the structure function:
st_expr([0.0; 1.0; 2.0;;])

This also work with hierarchical expressions! For example,

structure = TemplateStructure{(:f, :g)}(
  ((; f, g), (x1, x2, x3)) -> f(x1, g(x2), x3^2) - g(x3)
)

this is a valid structure!

We can also use this TemplateExpression in SymbolicRegression.jl searches!

For example, say that we want to fit *vector expressions*:
using SymbolicRegression
using MLJBase: machine, fit!, report

We first define our structure. This also has our variable mapping, which says we are fitting f(x1, x2), g1(x3), and g2(x3):

function my_structure((; f, g1, g2), (x1, x2, x3))
    _f = f(x1, x2)
    _g1 = g1(x3)
    _g2 = g2(x3)

    # We use `.x` to get the underlying vector
    out = map((fi, g1i, g2i) -> (fi + g1i, fi + g2i), _f.x, _g1.x, _g2.x)
    # And `.valid` to see whether the evaluations
    return ValidVector(out, _f.valid && _g1.valid && _g2.valid)
end
structure = TemplateStructure{(:f, :g1, :g2)}(my_structure)

Now, our dataset is a regular 2D array of inputs for X. But our y is actually a vector of 2-tuples!

X = rand(100, 3) .* 10

y = [
    (sin(X[i, 1]) + X[i, 3]^2, sin(X[i, 1]) + X[i, 3])
    for i in eachindex(axes(X, 1))
]

Now, since this is a vector-valued expression, we need to specify a custom elementwise_loss function:

elementwise_loss = ((x1, x2), (y1, y2)) -> (y1 - x1)^2 + (y2 - x2)^2

This reduces y and the predicted value of y returned by the structure function.

Our regressor is then:

model = SRRegressor(;
    binary_operators=(+, *),
    unary_operators=(sin,),
    maxsize=15,
    elementwise_loss=elementwise_loss,
    expression_type=TemplateExpression,
    # Note - this is where we pass custom options to the expression type:
    expression_options=(; structure),
)

mach = machine(model, X, y)
fit!(mach)

Let’s see the performance of the model:

report(mach)

We can also check the expression is split up correctly:

r = report(mach)
idx = r.best_idx
best_expr = r.equations[idx]
best_f = get_contents(best_expr).f
best_g1 = get_contents(best_expr).g1
best_g2 = get_contents(best_expr).g2

3. Created “Parametric Expressions”, for custom functional forms with per-class parameters: (ParametricExpression <: AbstractExpression)

Parametric Expressions are another example of an AbstractExpression with additional features than a normal Expression. This type allows SymbolicRegression.jl to fit a parametric functional form, rather than an expression with fixed constants. This is particularly useful when modeling multiple systems or categories where each may have unique parameters but share a common functional form and certain constants.

A parametric expression is constructed with a tree represented as a ParametricNode <: AbstractExpressionNode – this is an alternative type to the usual Node type which includes extra fields: is_parameter::Bool, and parameter::UInt16. A ParametricExpression wraps this type and stores the actual parameter matrix (under .metadata.parameters) as well as the parameter names (under .metadata.parameter_names).

Various internal functions have been overloaded for custom behavior when fitting parametric expressions. For example, mutate_constant will perturb a row of the parameter matrix, rather than a single parameter.

When fitting a ParametricExpression, the expression_options parameter in Options/SRRegressor should include a max_parameters keyword, which specifies the maximum number of separate parameters in the functional form.

Let's see an example of fitting a parametric expression:
using SymbolicRegression
using Random: MersenneTwister
using Zygote
using MLJBase: machine, fit!, predict, report

Let’s generate two classes of model and try to find it:

rng = MersenneTwister(0)
X = NamedTuple{(:x1, :x2, :x3, :x4, :x5)}(ntuple(_ -> randn(rng, Float32, 30), Val(5)))
X = (; X..., classes=rand(rng, 1:2, 30))  # Add class labels (1 or 2)

# Define per-class parameters
p1 = [0.0f0, 3.2f0]
p2 = [1.5f0, 0.5f0]

# Generate target variable y with class-specific parameters
y = [
    2 * cos(X.x4[i] + p1[X.classes[i]]) + X.x1[i]^2 - p2[X.classes[i]]
    for i in eachindex(X.classes)
]

When fitting a ParametricExpression, it tends to be more important to set up an autodiff_backend such as :Zygote or :Enzyme, as the default backend (finite differences) can be too slow for the high-dimensional parameter spaces.

model = SRRegressor(
    niterations=100,
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=30,
    expression_type=ParametricExpression,
    expression_options=(; max_parameters=2),  # Allow up to 2 parameters
    autodiff_backend=:Zygote,  # Use Zygote for automatic differentiation
    parallelism=:multithreading,
)

mach = machine(model, X, y)

fit!(mach)

The expressions are returned with the parameters:

r = report(mach);
best_expr = r.equations[r.best_idx]
@show best_expr
@show get_metadata(best_expr).parameters

4. Introduced a variety of new abstractions for user extensibility

v1 introduces several new abstract types to improve extensibility. These allow you to define custom behaviors by leveraging Julia’s multiple dispatch system.

Expression types: AbstractExpression: As explained above, SymbolicRegression now works on Expression rather than Node, by default. Actually, most internal functions in SymbolicRegression.jl are now defined on AbstractExpression, which allows overloading behavior. The expression type used can be modified with the expression_type and node_type options in Options.

  • expression_type: By default, this is Expression, which simply stores a binary tree (Node) as well as the variable_names::Vector{String} and operators::DynamicExpressions.OperatorEnum. See the implementation of TemplateExpression and ParametricExpression for examples of what needs to be overloaded.
  • node_type: By default, this will be DynamicExpressions.default_node_type(expression_type), which allows ParametricExpression to default to ParametricNode as the underlying node type.

Mutation types: mutate!(tree::N, member::P, ::Val{S}, mutation_weights::AbstractMutationWeights, options::AbstractOptions; kws...) where {N<:AbstractExpression,P<:PopMember,S}, where S is a symbol representing the type of mutation to perform (where the symbols are taken from the mutation_weights fields). This allows you to define new mutation types by subtyping AbstractMutationWeights and creating new mutate! methods (simply pass the mutation_weights instance to Options or SRRegressor).

Search states: AbstractSearchState: this is the abstract type for SearchState which stores the search process’s state (such as the populations and halls of fame). For advanced users, you may wish to overload this to store additional state details. (For example, LaSR stores some history of the search process to feed the language model.)

Global options and full customization: AbstractOptions and AbstractRuntimeOptions: Many functions throughout SymbolicRegression.jl take AbstractOptions as an input. The default assumed implementation is Options. However, you can subtype AbstractOptions to overload certain behavior.

For example, if we have new options that we want to add to Options:

Base.@kwdef struct MyNewOptions
    a::Float64 = 1.0
    b::Int = 3
end

we can create a combined options type that forwards properties to each corresponding type:

struct MyOptions{O<:SymbolicRegression.Options} <: SymbolicRegression.AbstractOptions
    new_options::MyNewOptions
    sr_options::O
end
const NEW_OPTIONS_KEYS = fieldnames(MyNewOptions)

# Constructor with both sets of parameters:
function MyOptions(; kws...)
    new_options_keys = filter(k -> k in NEW_OPTIONS_KEYS, keys(kws))
    new_options = MyNewOptions(; NamedTuple(new_options_keys .=> Tuple(kws[k] for k in new_options_keys))...)
    sr_options_keys = filter(k -> !(k in NEW_OPTIONS_KEYS), keys(kws))
    sr_options = SymbolicRegression.Options(; NamedTuple(sr_options_keys .=> Tuple(kws[k] for k in sr_options_keys))...)
    return MyOptions(new_options, sr_options)
end

# Make all `Options` available while also making `new_options` accessible
function Base.getproperty(options::MyOptions, k::Symbol)
    if k in NEW_OPTIONS_KEYS
        return getproperty(getfield(options, :new_options), k)
    else
        return getproperty(getfield(options, :sr_options), k)
    end
end

Base.propertynames(options::MyOptions) = (NEW_OPTIONS_KEYS..., fieldnames(SymbolicRegression.Options)...)

These new abstractions provide users with greater flexibility in defining the structure and behavior of expressions, nodes, and the search process itself. These are also of course used as the basis for alternate behavior such as ParametricExpression and TemplateExpression.

5. Added TensorBoardLogger.jl and other logging integrations via SRLogger

You can now track the progress of symbolic regression searches using TensorBoardLogger.jl, Wandb.jl, or other logging backends.

This is done by wrapping any AbstractLogger with the new SRLogger type, and passing it to the logger option in SRRegressor or equation_search:

using SymbolicRegression
using TensorBoardLogger

logger = SRLogger(
    TBLogger("logs/run"),
    log_interval=2,  # Log every 2 steps
)

model = SRRegressor(;
    binary_operators=[+, -, *],
    logger=logger,
)

The logger will track:

  • Loss curves over time at each complexity level
  • Population statistics (distribution of complexities)
  • Pareto frontier volume (can be used as an overall metric of search performance)
  • Full equations at each complexity level

This works with any logger that implements the Julia logging interface.

6. Support for Zygote.jl and Enzyme.jl within the constant optimizer, specified using the autodiff_backend option

Historically, SymbolicRegression has mostly relied on finite differences to estimate derivatives – which actually works well for small numbers of parameters. This is what Optim.jl selects unless you can provide it with gradients.

However, with the introduction of ParametricExpressions, full support for autodiff-within-Optim.jl was needed. v1 includes support for some parts of DifferentiationInterface.jl, allowing you to actually turn on various automatic differentiation backends when optimizing constants. For example, you can use

Options(
    autodiff_backend=:Zygote,
)

to use Zygote.jl for autodiff during BFGS optimization, or even

Options(
    autodiff_backend=:Enzyme,
)

for Enzyme.jl (though Enzyme support is highly experimental).

7. Other Changes

Changed output file handling

Instead of writing to a single file like hall_of_fame_<timestamp>.csv, outputs are now organized in a directory structure. Each run gets a unique ID (containing a timestamp and random string, e.g., 20240315_120000_x7k92p), and outputs are saved to outputs/<run_id>/. Currently, only saves hall_of_fame.csv (and hall_of_fame.csv.bak), with plans to add more logs and diagnostics in this folder in future releases.

The output directory can be customized via the output_directory option (defaults to ./outputs). A custom run ID can be specified via the new run_id parameter passed to equation_search (or SRRegressor).

Other Small Features in v1.0.0

  • Support for per-variable complexity, via the complexity_of_variables option.
  • Option to force dimensionless constants when fitting with dimensional constraints, via the dimensionless_constants_only option.
  • Default maxsize increased from 20 to 30.
  • Default niterations increased from 10 to 50, as many users seem to be unaware that this is small (and meant for testing), even in publications. I think this 50 is still low, but it should be a more accurate default for those who don’t tune.
  • MLJ.fit!(mach) now records the number of iterations used, and, should mach.model.niterations be changed after the fit, the number of iterations passed to equation_search will be reduced accordingly.
  • Fundamental improvements to the underlying evolutionary algorithm
    • New mutation operators introduced, swap_operands and rotate_tree – both of which seem to help kick the evolution out of local optima.
    • New hyperparameter defaults created, based on a Pareto front volume calculation, rather than simply accuracy of the best expression.
  • Major refactoring of the codebase to improve readability and modularity
  • Identified and fixed a major internal bug involving unexpected aliasing produced by the crossover operator
    • Segmentation faults caused by this are a likely culprit for some crashes reported during multi-day multi-node searches.
    • Introduced a new test for aliasing throughout the entire search state to prevent this from happening again.
  • Improved progress bar and StyledStrings integration.
  • Julia 1.10 is now the minimum supported Julia version.
  • Also see the “Update Guide” below for more details on upgrading.

Update Guide

Note that most code should work without changes! Only if you are interacting with the return types of equation_search or report(mach), or if you have modified any internals, should you need to make some changes.

Also note that the “hall of fame” CSV file is now stored in a directory structure, of the form outputs/<run_id>/hall_of_fame.csv. This is to accommodate additional log files without polluting the current working directory. Multi-output runs are now stored in the format .../hall_of_fame_output1.csv, rather than the old format hall_of_fame_{timestamp}.csv.out1.

So, the key changes are, as discussed above, the change from Node to Expression as the default type for representing expressions. This includes the hall of fame object returned by equation_search, as well as the vector of expressions stored in report(mach).equations for the MLJ interface. If you need to interact with the internal tree structure, you can use get_contents(expression) (which returns the tree of an Expression, or the named tuple of a ParametricExpression - use get_tree to map it to a single tree format).

To access other info stored in expressions, such as the operators or variable names, use get_metadata(expression).

This also means that expressions are now basically self-contained. Functions like eval_tree_array no longer require options as arguments (though you can pass it to override the expression’s stored options). This means you can also simply call the expression directly with input data (in [n_features, n_rows] format).

Before this change, you might have written something like this:

using SymbolicRegression

x1 = Node{Float64}(; feature=1)
options = Options(; binary_operators=(+, *))
tree = x1 * x1

This had worked, but only because of some spooky action at a distance behavior involving a global store of last-used operators! (Noting that Node simply stores an index to the operator to be lightweight.)

After this change, things are much cleaner:

options = Options(; binary_operators=(+, *))
operators = options.operators
variable_names = ["x1"]
x1 = Expression(Node{Float64}(; feature=1); operators, variable_names)

tree = x1 * x1

This is now a safe and explicit construction, since * can lookup what operators each expression uses, and infer the right indices! This operators::OperatorEnum is a tuple of functions, so does not incur dispatch costs at runtime. (The variable_names is optional, and gets stripped during the evolution process, but is embedded when returned to the user.)

We can now use this directly:

println(tree)       # Uses the `variable_names`, if stored
tree(randn(1, 50))  # Evaluates the expression using the stored operators

Also note that the minimum supported version of Julia is now 1.10. This is because Julia 1.9 and earlier have now reached end-of-life status, and 1.10 is the new LTS release.

Additional Notes

  • Custom Loss Functions: Continue to define these on AbstractExpressionNode.
  • General Usage: Most existing code should work with minimal changes.
  • CI Updates: Tests are now split into parts for faster runs, and use TestItems.jl for better scoping of test variables.

Be sure to check out the documentation here: Home · SymbolicRegression.jl

You can also see the full CHANGELOG.md for a history of the library.


In general I’m very excited to see what people can do with these new features. They make SymbolicRegression.jl a lot more modular and extensible, so heavy customizations are now possible.

59 Likes

New features seem very useful! Thanks for hard work! - I look forward to test it.

I did install new SymbolicRegression earlier today, but the installation seemed to cause ModelingToolkit to revert to an older version (going from 9.50 → 9.40 or so). I didn’t think MTK depended on SymbolicRegression?? Or perhaps there is a problem if SymbolicRegression contains (parts of) DynamicQuantities?

1 Like

Are you able to see which dependency is holding things back? SR itself doesn’t depend in ModelingToolkit but they have a lot of mutual dependencies

2 Likes

I’ll check “soon”. I’m in the midst of preparing exam problems.

Here is what happens when I add SymbolicRegression:

(@v1.11) pkg> add SymbolicRegression
   Resolving package versions...
   Installed LuxCore ──────────────────────── v1.1.0
   Installed DynamicExpressions ───────────── v1.5.0
   Installed SymbolicRegression ───────────── v1.0.0
   Installed MLDataDevices ────────────────── v1.5.3
   Installed LuxLib ───────────────────────── v1.3.7
   Installed ConstructionBase ─────────────── v1.5.6
   Installed Interfaces ───────────────────── v0.3.2
   Installed Symbolics ────────────────────── v6.11.0
   Installed SymbolicUtils ────────────────── v3.4.1
   Installed ModelingToolkitStandardLibrary ─ v2.15.1
   Installed Lux ──────────────────────────── v1.2.3
    Updating `C:\Users\Bernt\.julia\environments\v1.11\Project.toml`
⌃ [b2108857] ↓ Lux v1.3.3 ⇒ v1.2.3
⌃ [961ee093] ↓ ModelingToolkit v9.51.0 ⇒ v9.40.0
⌃ [16a59e39] ↓ ModelingToolkitStandardLibrary v2.17.0 ⇒ v2.15.1
⌅ [429524aa] ↓ Optim v1.10.0 ⇒ v1.9.4
  [8254be44] + SymbolicRegression v1.0.0
    Updating `C:\Users\Bernt\.julia\environments\v1.11\Manifest.toml`
⌅ [187b0558] ↓ ConstructionBase v1.5.8 ⇒ v1.5.6
⌅ [459566f4] ↓ DiffEqCallbacks v4.2.1 ⇒ v3.9.1
  [a40a106e] + DynamicExpressions v1.5.0
⌅ [d9f16b24] ↓ Functors v0.5.1 ⇒ v0.4.12
  [85a1e053] + Interfaces v0.3.2
⌅ [984bce1d] + LambertW v0.4.6
  [30fc2ffe] + LossFunctions v0.11.2
⌃ [b2108857] ↓ Lux v1.3.3 ⇒ v1.2.3
⌃ [bb33d45b] ↓ LuxCore v1.2.0 ⇒ v1.1.0
⌃ [82251201] ↓ LuxLib v1.3.8 ⇒ v1.3.7
⌃ [7e8f7934] ↓ MLDataDevices v1.6.2 ⇒ v1.5.3
⌃ [961ee093] ↓ ModelingToolkit v9.51.0 ⇒ v9.40.0
⌃ [16a59e39] ↓ ModelingToolkitStandardLibrary v2.17.0 ⇒ v2.15.1
⌅ [429524aa] ↓ Optim v1.10.0 ⇒ v1.9.4
  [8254be44] + SymbolicRegression v1.0.0
⌃ [d1185830] ↓ SymbolicUtils v3.7.2 ⇒ v3.4.1
⌃ [0c5d862f] ↓ Symbolics v6.21.0 ⇒ v6.11.0

Should be fixed on 1.0.1 (to-be-registered in 15min), thanks for the report!

    Updating `/private/var/folders/1h/xyppkvx52cl6w3_h8bw_gdqh0000gr/T/jl_aLZbNp/Project.toml`
  [b2108857] + Lux v1.3.3
  [961ee093] + ModelingToolkit v9.52.0
  [8254be44] + SymbolicRegression v1.0.1 `https://github.com/MilesCranmer/SymbolicRegression.jl.git#master`
1 Like

OK – v1.0.1 seems to have fixed the problem. At least it didn’t downgrade any packages on my home PC. I look forward to try to understand and test the new features!

1 Like

A short comment from a bystander who has yet to use your package beyond a toy example: this looks fantastic! :heart_eyes:

1 Like

To get a more concrete idea of the new features… consider the following “experimental” relationship between current density (i'' in my weird notation… i is current, and I use one prime per length dimension, i.e, two primes to make it per area) and voltage (u).

Data
data_oer = [0 1.173

0.017936832458011545 1.4102683482096974

0.016479024323948994 1.4064959951097524

0.020056884433730368 1.4126686433437579

0.02356171637723444 1.4169492506277876

0.021766585307157173 1.4143862383727461

0.02600016723241648 1.4195445534702076

0.030805401601058532 1.421311366853073

0.0335549897981243 1.4246330555142535

0.027998065354027282 1.422847348066021

0.03734260984357481 1.4271516524108123

0.04206060304188324 1.4289961788986647

0.04623965033581807 1.4315879386426131

0.051084505034082454 1.4341449720763708

0.05758306350078065 1.4404714425848681

0.05666557075253883 1.4369360490452994

0.06559662023187465 1.4393740011993852

0.07227334680446507 1.4415694368283214

0.07970885726948984 1.4435655093549191

0.08557577632254196 1.4453786024904824

0.09198081936953889 1.4471119556249883

0.10040715001045586 1.449116594842907

0.10857224985512065 1.4511464706854122

0.11984376871048752 1.4529096501465464

0.1333697803206845 1.4546991192155792

0.14594609660198501 1.4564556777044653

0.15777894618536925 1.4580937491862205

0.1706090772757253 1.4602808869378285

0.18311367937390402 1.4617801905852565

0.1964536135233227 1.4633117376039322

0.20996811620158431 1.4653040143828335

0.22677711845991494 1.4669618485748035

0.24381298468916923 1.4683311278206275

0.26084670473064997 1.4696791683922448

0.27788328635582876 1.4710555271961374

0.2949205833769321 1.472438965558099

0.3119464340632434 1.4737091309909591

0.3289379457451789 1.4746394776365155

0.3459215880719448 1.4754919491433145

0.3629052303987107 1.4763444206501137

0.37988958812140117 1.4772039717149816

0.39687179965631814 1.478042284105643

0.4138561573790086 1.4789018351705112

0.4308397997057745 1.4797543066773102

0.44782415742846493 1.4806138577421784

0.4648063689633819 1.4814521701328398

0.48179072668607226 1.4823117211977077

0.4987743690128382 1.483164192704507

0.5157444188170388 1.4838821526079982

0.5326879989720328 1.4843381688629427

0.5496337253148004 1.4848154237920939

0.566578020865719 1.4852785196051073

0.583523031812562 1.4857486949761896

0.6004658965716316 1.4861976316730652

0.6174080459347766 1.4866394888118721

0.6343530568816197 1.4871096641829544

0.6512959216406892 1.48755860087983

0.6682423633793813 1.48804293536705

0.6851859435343753 1.4884989516219946

0.7021309544812184 1.4889691269930767

0.719073819240288 1.4894180636899526

0.7360181147912066 1.489881159502966

0.7529631257380496 1.4903513348740483

0.769905990497119 1.4908002715709239

0.7868517168398866 1.491277526500075

0.8037952969948807 1.4917335427550196

0.8207381617539503 1.4921824794518952

0.8376824573048689 1.4926455752649086

0.8546281836476365 1.4931228301940598

0.8715717638026305 1.4935788464490043

0.888516059353549 1.4940419422620177

0.9054596395085431 1.4944979585169622

0.9224039350594616 1.4949610543299756

0.9393467998185312 1.4954099910268515

0.9562918107653743 1.4958801663979338

0.9732353909203684 1.4963361826528783

0.9901811172631358 1.4968134375820292

1.007121835834432 1.4972411356046986

1.0240518234668605 1.4975626402563353

1.0409768033278177 1.4978345880014903

1.057901783188775 1.4981065357466452

1.074826763049732 1.4983784834918001

1.0917495967229156 1.4986291925627486

1.1086760073757218 1.4989152994240411

1.1255981256529808 1.499158928936921

1.142523105513938 1.4994308766820759

1.159448085374895 1.4997028244272308

1.176373065235852 1.4999747721723857

1.1932973297008846 1.5002396403594718

1.2102215941659173 1.5005045085465578

1.2271465740268743 1.5007764562917127

1.244069407700058 1.5010271653626612

1.2609951029569397 1.501306192665885

1.2779179366301232 1.5015569017368335

1.294843631887005 1.5018359290400574

1.311768611747962 1.5021078767852123

1.328693591608919 1.5023798245303672

1.3456157098861783 1.5026234540432468

1.36254140514306 1.5029024813464706

1.3794656696080925 1.5031673495335567

1.3963906494690497 1.5034392972787116

1.4133156293300069 1.5037112450238665

1.4302398937950394 1.5039761132109528

1.447164158260072 1.5042409813980389

1.464089138121029 1.5045129291431938

1.481014117981986 1.5047848768883487

1.4979369516551697 1.5050355859592972

1.51485763914058 1.5052650563560392

1.531775465042292 1.5054662085205057

1.5486911447562306 1.505646122010766

1.5656082552620183 1.5058401946171638

1.5825246503718815 1.5060271876654927

1.5994424762735935 1.5062283398299594

1.6163567251956832 1.5063940942040819

1.6332752664933199 1.5066023259266172

1.650191661603183 1.5067893189749462

1.6671102029008198 1.5069975506974818

1.6840258826147583 1.5071774641877418

1.7009394161409235 1.5073361390037956

1.7178572420426357 1.507537291168262

1.7347729217565744 1.5077172046585223

1.7516921784501354 1.5079325159391266

1.7686071427681496 1.508105349871318

1.7855263994617105 1.5083206611519222

1.8024392175919512 1.508472256409907

1.8193570434936632 1.5086734085743736

1.8362734386035264 1.5088604016227025

1.8531905491093141 1.5090544742291003

1.8701047980314038 1.5092202286032228

1.8870262009127383 1.5094567785580337

1.9039383036470545 1.5096012942579495

1.9208589911324647 1.5098307646546916

1.9377703784708564 1.5099682007965387

1.9546896351644176 1.510183512077143

1.9716074610661296 1.5103846642416097

1.988526002363766 1.510592895964145

2.005438105098082 1.510737411664061

2.022356646395719 1.5109456433865966

2.0392637413585635 1.5110406021800307

2.0561765594888044 1.5111921974380156

2.073091523806818 1.511365031370207

2.089998618769663 1.511459990163641

2.106913583087677 1.5116328240958323

2.1238256858219935 1.5117773397957484

2.1407363577644603 1.5119076963795266

2.157652037478399 1.5120876098697866

2.174562709420866 1.512217966453565

2.191476958342956 1.5123837208276876

2.208389061077272 1.5125282365276036

2.225297586831966 1.5126373544371754

2.2422111203581307 1.512796029253229

2.2591239384883712 1.512947624511214

2.2760331796389894 1.5130638219788544

2.2929510055407016 1.513264974143321

2.30986024669132 1.5133811716109618

2.3267687724460133 1.5134902895205335

2.343685167555877 1.5136772825688625

2.360597985686117 1.5138288778268474

2.377509373024509 1.5139663139686945

2.3944221911547494 1.5141179092266792

2.411331432305368 1.51423410669432

2.428247827415231 1.5144210997426488

2.445158499357698 1.5145514563264273

2.4620677405083162 1.514667653794068

2.4789855664100284 1.5148688059585345

2.4958933767687976 1.5149708443100376

2.512805479503114 1.5151153600099534

2.5297154360496563 1.515238637035663

2.546625392596199 1.5153619140613726

2.563536064538666 1.5154922706451508

2.580448167272982 1.5156367863450668

2.597352400652129 1.5157034269062255

2.614260211010898 1.5158054652577286

2.6311751753289125 1.5159782991899198

2.6480801241039833 1.5160520193091476

2.6649900806505262 1.516175296334857

2.681902183384842 1.516319812034773

2.6988085629517626 1.5164076912701383

2.715715657914607 1.5165026500635725

2.7326263298570743 1.5166330066473508

2.7495355710076925 1.5167492041149915

2.7664455275542355 1.5168724811407008

2.7833554841007775 1.5169957581664104

2.800267586835094 1.5171402738663264

2.817172535610165 1.517213993985554

2.83407963057301 1.5173089527789883

2.8509938794950993 1.5174747071531107

2.8679002590620195 1.517562586388476

2.8848080694207887 1.517664624739979

2.9017201721551045 1.517809140439895

2.918630844097572 1.5179394970236733

2.935537223664492 1.5180273762590386

2.952447180211035 1.5181506532847482

2.970001350787588 1.5183150703454225

2.986281806613634 1.5185428102471161

3.0022000737991545 1.5184533429231641

3.0148647019958816 1.518649293401191]

data_oer[:,2] = data_oer[:,2] .- 1.173

sort!(data_oer,dims=1)

[In the included data set, the first column is i'', while the second is u. ]

A. If one is clueless about a good model, then the “old” SymbolicRegression.jl package (prior to v1.0) is fine for building a model of either u(i'') or i''(u).

B. A common model for this relationship looks like this

i'' = i''_0\cdot \left(\exp\left(k_1\cdot u\right) - \exp\left(-k_1\cdot u\right) \right)

Here, i''_0 is a constant (depending on temperature, but the plotted data are at constant temperature). Sometimes, it is assumed that k_1 = k_2, which makes it possible to introduce the \sinh() function, and thus find an explicit function u(i'').

To improve the fit, some publications assume that k_j = k_j(u) or k_j = k_j(i'').
→ Is this what “Template Expressions” is designed to handle, i.e., enforce the above structure of the relationship between u and i'', but let SymbolicRegression.jl find the particular form of k_j(i'') (or k_j(u))?

C. If I simply want to enforce the function form of i'' above, while enforcing i''_0, k_1, and k_2 to be constants…
→ Is this what the “Parametric Expressions” is designed to handle?

D. What if I want to force i''_0 to be a constant (and independent of u or i''), but will allow k_j to depend on, say, i''?

2 Likes

Exactly! This is what TemplateExpression is built to handle. You could do this with:

i_0 = 1
structure = TemplateStructure{(:k1,)}(
    ((; k1), (u,)) -> i_0 * (exp(k1(u) * u) - exp(-k1(u) * u))
)

You are allowed to evaluate an expression twice like the above, though to speed things up you could just do:

i_0 = 1
function form((; k1), (u,))
    k1_u = k1(u) * u
    i_0 * (exp(k1_u) - exp(-k1_u))
end
structure = TemplateStructure{(:k1,)}(form)

It is just a regular function after all.

Then it will learn a function k1 of one argument that optimises your loss function.

I think this sort of thing would be better off with direct Optim.jl. SymbolicRegression gives you that symbolic space to optimise.

ParametricExpressions let you jointly solve for a symbolic form AND parameters that go into that symbolic form. There’s an example here: Parameterized Expressions · SymbolicRegression.jl that hopefully helps clarify.

Right now there is no way to combine these in a ParametricTemplateExpression (maybe there should be), but in theory there is nothing stopping you. You can create your own AbstractExpression which has a full interface described: Customization · SymbolicRegression.jl using Interfaces.jl. It is pretty flexible. ParametricExpression, TemplateExpression, and regular Expression are all just implementations of this interface!

3 Likes

OK – I have run the code, and SymbolicRegression has found an expression for k1:

julia> r = report(mach)
julia> r.equation_strings[r.best_idx]
k1 = (((0.4165421404587887 - ((-4.524408239848075 - (#1 * -16.415503202513243)) * ((#1 * -23.199723760515724) + 6.297629172066734))) * (#1 + -0.2364119362351886)) * 38.57101510129755) * #1

Question 1: What is #1? Is it the input variable? I used X = (u=data_oer[:,2],). In other words, is it the same as u?

Next, how do I predict the output using the model?

julia> predict(mach,X)
AssertionError: typeof(r) === typeof(l)

Stacktrace:
  [1] apply_operator(op::typeof(*), l::Co...

Question 2: This used to work for regular SymbolicRegression, prior to v1.0. Should it work here?

Finally…
Question 3: Is there a way to convert the equation_string into a proper function, say f(u), so that I can predict single values of u, and not the entire training vector X?

It’s the first argument to the function. The reason it doesn’t show the variable names is because you might choose to reuse the function in the template, like:

f(x) - f(y)

or even

f(f(x))

So the #1 simply indicates the first arg to f, whatever that might be.

Oops, this looks like a bug. I’ll try to take a look later today.

So you can use the expression as a callable, like

eq = report(mach).equations[end]
eq(X)

But indeed this is set up for batches input by default. I think the best option is to wrap it with something that packs/unpacks from a 1x1 matrix. The reason for this is that general vector input is ambiguous (is it a single input with n features, or 1D input with n examples?)

3 Likes

@BLI The aforementioned bug should be fixed on 1.0.3 once everything gets merged and registered: fix: `predict` for TemplateExpressions by MilesCranmer · Pull Request #374 · MilesCranmer/SymbolicRegression.jl · GitHub (was just a missing method)

2 Likes

Now, things work!

I did some tweaking – see comment at the bottom of this follow-up.

i₀ = 1
R = 8.314
F = 9.648e4
T = 273.15 + 80
function form((; α₁, α₂), (u,))
    k₁ = α₁(u) * F/R/T * u
    k₂ = α₂(u) * F/R/T * u
    i₀ * (exp(k₁) - exp(-k₂))
end
structure = TemplateStructure{(:α₁,:α₂)}(form)
#
model_oer = SRRegressor(
    batching=true,
    niterations=400,
    binary_operators=[+,-,*],
    maxsize=30,
    timeout_in_seconds=60.0 *5, 
    save_to_file=false,    # causes no report file to be written
    expression_type = TemplateExpression,
    expression_options=(;structure),
)

The result of fitting is as below:

If I use

y_hat = predict(mach,X)

I get the following fit:


The “unfortunate” bump for low values of i’’ is simply because the predict function computes outputs based on the training data.

However, I’d like to convert the result to functions. To achieve this, I created two “utility functions”:

function string_equation_2_function(str, var)
    # str: string with semi-colon separated equations 
    # var: vector of strings with variable names
    #
    # Replace dummy variable names of type "#1", etc. 
    # with variable names
    _str = str
    for i in 1:length(var)
        _str = replace(_str,"#"*string(i)=>var[i])
    end
    _str = replace(_str, " "=>"")
    # Split string of equations into vector of equations
    str_vec = split(_str,";")
    #
    # Convert vector of variable strings into comma separated variable string
    var_str = ""
    for i in 1:length(var)
        var_str = var_str*var[i]*","
    end
    var_str = var_str[1:end-1]
    #
    # Replace LHS variable names with functions with comma separated variables 
    for i in 1:length(str_vec)
        fnm,_ = split(str_vec[i],"=")
        str_vec[i] = replace(str_vec[i],fnm => fnm*"("*var_str*")")
    end
    return str_vec
end
#
function string_function_2_function(str)
    return eval(Meta.parse(str))
end

The fitted models are:

With the utility function, I can do:

julia> var = ["u"]
julia> svec = string_equation_2_function(r.equation_strings[2],var)
2-element Vector{SubString{String}}:
 "α₁(u)=u+u"
 "α₂(u)=u*-1.9997011305462111"
julia> svec |> latexify

and then:

julia> string_function_2_function.(svec)
2-element Vector{Function}:
 α₁ (generic function with 1 method)
 α₂ (generic function with 1 method)
julia> α₁(0.2)
0.4

and then find the function value for more input data than used in the model fitting, leading to:

Wish… it would be nice if SymbolicRegression offered built-in function that do the same as my utility function string_equation_2_function(str, var) and string_function_2_function(str). I don’t really like the names I “invented” – and the implementation can surely be done in a more professional way. But I think the possibilities of these two utility functions are very useful.

1 Like

I would declare these const so you don’t get a performance hit.

It looks like the data has that bump too? Or maybe I’m missing something. You could try with different equations too, like predict(mach, (data=X, idx=idx)) for idx anywhere from 1 to length(report(mach).equations). If you have some test data you could just see which equation has the best validation loss.

Just to check, why can’t you use the returned equation as the function? Like report(mach).equations[10] if it’s the 10th equation. Since it’s callable you can just use that.

Or maybe you mean: you want some code representing the function that you can paste into a script, and use that?

In the future it might be nice if DynamicExpressions.jl had a convert(Expr, ::AbstractExpression) for this purpose. I feel like it actually wouldn’t be too hard to get this working. (PRs appreciated!)

Oh I see. So for this, actually the function

get_tree(ex::AbstractExpression)

is supposed to generating a single closed-form binary tree (which you could pass to string_tree to get the string you are looking for). But it isn’t yet set up for TemplateExpression to handle all cases.

Technically it won’t be possible to handle ALL cases (without limitations on the structure functions), but I think scalar expressions like this it should be possible to get working… Will think more when I get a chance. But if you are interested, the code is here:

[Warning: boring technical/internal details] – The reason this version is incorrect is because it assumes that the number of features in the dataset == max number of arguments to any function. But what it should do is pass an infinite iterator for variable_expressions that constructs the variables. That way if you have ((; f, g), (x1, x2, x3)), get_tree will actually be able to connect x3 to the 3rd input feature.

1 Like

Thanks for very useful tips. I’ll take a look at it tomorrow.

1 Like

P.S., it turns out you can just use nullary functions for including scalar parameters in the fixed structure :metal:

structure = TemplateStructure{(:a, :f, :g)}(
    ((; a, f, g), (x, y)) -> a() * cos(f(x) + g(y))
)

which is equivalent to the functional form:

\begin{align*} \text{output} &= a \cos(f(x) + g(y)), \ \ \text{where} \\ & a \in \mathbb{R} \\ & f: \mathbb{R} \rightarrow \mathbb{R}\\ & g: \mathbb{R} \rightarrow \mathbb{R} \end{align*}
1 Like

OK – it is not obvious to me what this function does. With, say


or

I would have guessed that using the function returns two values, one for \alpha_1 and one for \alpha_2. But it seems to only return one value…

julia> r = report(mach)
julia> r.equations[3](0.1)   # doesn't work
julia> r.equations[3]([0.1]) # doesn't work
julia> r.equations[3]([0.1]')
1-element Vector{Float64}:
 -0.16067933839153825

Does this mean that it returns – not the equation listed – but the composed expression, with input argument organized as a row matrix (in the single input case) or a matrix with one row per input argument and one column per value I want it to evaluate?

In other words, that r.equations[i] simply does the same as predict(mach, (data=X,idx=i)), with the main difference that predict takes data in a named tuple form, while r.equations takes data organized in a matrix?

You have to always give it 2D input of the form (num_features, num_rows) - which is due to the ambiguity issues mentioned above.

1 Like

If you want to get the individual expressions out, you can do:

X = randn(1, 1)
ex = r.equations[end]
sub_expressions = get_contents(ex)

sub_expressions.α₁(X)
sub_expressions.α₂(X)