Performance and allocation with recursive tree-equations

Hi Julia community,

I am trying to build a tree-equations structure, which should evaluate as efficiently as possible. I’ve tried many different approaches so far, but can’t get rid of some allocations and some of the behavior is confusing me. I am open to any suggestions to speed this up.

To build equations, I have an abstract supertype and three types of mutable structs. The opera_node is for *, +, - (arity==2) and cos, sin, etc (arity == 1).
The const_leav is just a constant and the varia_leav holds the index to the variable, which is in a tuple of tuples.

abstract type equation_nodes end

mutable struct opera_node <: equation_nodes
    arity::Int64; opera::Int64
    lef::equation_nodes; rig::equation_nodes

    opera_node(a::Int64, o::Int64) = (x = new(); x.arity = a; x.opera = o; x)
    opera_node(a::Int64, o::Int64, l::equation_nodes) = (x = new(); x.arity = a; x.opera = o; x.lef = l; x)
    opera_node(a::Int64, o::Int64, l::equation_nodes, r::equation_nodes) = (x = new(); x.arity = a; x.opera = o; x.lef = l; x.rig = r; x)
end

mutable struct const_leav <: equation_nodes
    value::Float64
end

mutable struct varia_leav <: equation_nodes
    value::Int64
end

vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4))

The function which evaluates the equation goes down recursively and evaluate from the bottom up:

function eval_equation(eq::equation_nodes, vars::Tuple, ops::Tuple)::typeof(vars[1])
    if typeof(eq) <: const_leav
        zeros = vars[1] .- vars[1]  # to get a 0.0-tuple of appropriate length
        cur_val = zeros .+ eq.value # and ensure to return always same type

    elseif typeof(eq) <: varia_leav
        cur_val = vars[eq.value]

    elseif eq.arity == 1 
        cur_val = ops[eq.opera].(eval_equation(eq.lef, vars, ops))
        # cur_val = ops[1][eq.opera].(eval_equation(eq.lef, vars, ops))

    elseif eq.arity == 2
        cur_val = ops[eq.opera].(eval_equation(eq.lef, vars, ops), eval_equation(eq.rig, vars, ops))
        # cur_val = ops[2][eq.opera].(eval_equation(eq.lef, vars, ops), eval_equation(eq.rig, vars, ops))
    end

    return cur_val
end

Now, if I test the same simple equation, that essentially just adds two tuples, I get a longer execution time and different results depending on how many entries there are in the ops tuple containing the operations, although they are unused.

eq1 = opera_node(2, 1, varia_leav(1), varia_leav(2))

# comparison
f(vars) = vars[1] .+ vars[2]
@btime f($vars)
# --> 0.875 ns (0 allocations: 0 bytes)

# with three entries in ops
ops = (+, *, cos) # -> 50 ns with 3 entries and 200 ns with 4+ entries
@btime eval_equation($eq1, $vars, $ops)
# --> 48.155 ns (4 allocations: 352 bytes)

# with unused 4th entry -> huh?
ops = (+, *, cos, -) # -> 50 ns with 3 entries and 200 ns with 4+ entries
@btime eval_equation($eq1, $vars, $ops)
# --> 268.736 ns (7 allocations: 640 bytes)

In the end, I’d like to have a tuple of tuples like the following, to have a better distinction of the operation, but this yields even longer calculation times and more allocations:

ops = ((sin, cos), (+, -, *, /)) # -> supposed to be in the end
eq2 = opera_node(2, 1, const_leav(2), opera_node(2, 3, const_leav(5), opera_node(1, 2, opera_node(2, 3, const_leav(5), varia_leav(1)))))

# the out-commented lines in the elseifs of the eval_equation need to be switched, because of the nesting of the ops - tuple

# --> 891.667 ns (24 allocations: 2.11 KiB) 

Do I make someting wrong with the typing and is there something I am missing? Maybe I need to add someting to the AbstractType? How can the results be this different? I love Julia and I want to get the best out of it :slight_smile:

Thank you very much in advance

This field has an abstract type, and the compiler can thus not figure out how to represent the structure in memory in an efficient way. You’ll also incur dynamic dispatch each time you call a function on the field. Try using a type parameter instead, or store information as values rather than in the type domain.

Thank you for your input!

With “store information as values”, did you have the following in mind?

vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4))
ops = ((sin, cos), (+, -, *, /)) # -> supposed to be in the end

eq1 = (2, 1, (0, 1), (0, 2))

function eval_nested_tuple(eq::Tuple, vars::Tuple, ops::Tuple)::typeof(vars[1])
    if eq[1] == -1
        return eq[2] .* (vars[1] .- vars[1])
    elseif eq[1] == 0
        return vars[eq[2]]
    elseif eq[1] == 1
        return ops[1][eq[2]].(eval_nested_tuple(eq[3], vars, ops))
    elseif eq[1] == 2
        return ops[2][eq[2]].(eval_nested_tuple(eq[3], vars, ops), eval_nested_tuple(eq[4], vars, ops))
    end
end

eval_nested_tuple(eq1, vars, ops)
@btime eval_nested_tuple($eq1, $vars, $ops)
# --> 227.079 ns (5 allocations: 416 bytes)

This does still have more or less the same timings.
And as for the type parameter, I am not sure, if I did that right and I can get it only to work without the dynamic adding of nodes. It, however, still yields the same performance for the simple case of adding the two tuple.

abstract type equation_nodes end

mutable struct opera_node{a, o, L, R} <: equation_nodes
    arity::a
    opera::o
    lef::L
    rig::R
    # opera_node(a::Int64, o::Int64) = (x = new(); x.arity = a; x.opera = o; x)
    # opera_node(a::Int64, o::Int64, l::equation_nodes) = (x = new(); x.arity = a; x.opera = o; x.lef = l; x)
    # opera_node(a::Int64, o::Int64, l::equation_nodes, r::equation_nodes) = (x = new(); x.arity = a; x.opera = o; x.lef = l; x.rig = r; x)
end

mutable struct const_leav
    value::Float64
end

mutable struct varia_leav
    value::Int64
end

function eval_equation(eq, vars::Tuple, ops::Tuple)::typeof(vars[1])
    if typeof(eq) <: const_leav
        zeros = vars[1] .- vars[1]  # to get a 0.0-tuple of appropriate length
        cur_val = zeros .+ eq.value # and ensure to return always same type

    elseif typeof(eq) <: varia_leav
        cur_val = vars[eq.value]

    elseif eq.arity == 1
        cur_val = ops[1][eq.opera].(eval_equation(eq.lef, vars, ops))

    elseif eq.arity == 2
        cur_val = ops[2][eq.opera].(eval_equation(eq.lef, vars, ops), eval_equation(eq.rig, vars, ops))
    end

    return cur_val
end

vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4))
ops = ((sin, cos), (+, -, *, /)) # -> supposed to be in the end
eq1 = opera_node(2, 1, varia_leav(1), varia_leav(2))

eval_equation(eq1, vars, ops)

@btime eval_equation($eq1, $vars, $ops)
# --> 225.687 ns (5 allocations: 416 bytes)

I am sorry, if I am making something obviously wrong. I started learing during Advent of Code this December and know otherwise only Python, where I didn’t learn much of the types.

Unfortunately, the following doesn’t speed it up either

mutable struct opera_node_1{o,L}
    opera::o
    lef::L
end

mutable struct opera_node_2{o,L,R}
    opera::o
    lef::L
    rig::R
end

mutable struct const_leav
    value::Float64
end

mutable struct varia_leav
    value::Int64
end

eval_equation(eq::opera_node_1, vars, ops) = ops[1][eq.opera].(eval_equation(eq.lef, vars, ops))
eval_equation(eq::opera_node_2, vars, ops) = ops[2][eq.opera].(eval_equation(eq.lef, vars, ops), eval_equation(eq.rig, vars, ops))
eval_equation(eq::const_leav, vars, ops) = eq.value .* (vars[1] - vars[1])
eval_equation(eq::varia_leav, vars, ops) = cur_val = vars[eq.value]

vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4))
ops = ((sin, cos), (+, -, *, /))
eq1 = opera_node_2(1, varia_leav(1), varia_leav(2))

eval_equation(eq1, vars, ops)
@btime eval_equation($eq1, $vars, $ops)
# --> 224.294 ns (5 allocations: 416 bytes)

I checked with a profiler and got

using BenchmarkTools

abstract type equation_nodes end

struct opera_node{L, R} <: equation_nodes
    arity::Int64; opera::Int64
    lef::L; rig::R

    opera_node(a::Int64, o::Int64) = new{Nothing, Nothing}(a, o, nothing, nothing)
    opera_node(a::Int64, o::Int64, l::L) where {L} = new{L, Nothing}(a, o, l, nothing)
    opera_node(a::Int64, o::Int64, l::L, r::R) where {L, R} = new{L, R}(a, o, l, r)
end

struct const_leav <: equation_nodes
    value::Float64
end

struct varia_leav <: equation_nodes
    value::Int64
end

function eval_equation(eq, vars, ops)
    if typeof(eq) <: const_leav
        zeros = vars[1] .- vars[1]  # to get a 0.0-tuple of appropriate length
        cur_val = zeros .+ eq.value # and ensure to return always same type

    elseif typeof(eq) <: varia_leav
        cur_val = vars[eq.value]

    elseif eq.arity == 1
        lef = eval_equation(eq.lef, vars, ops)
        cur_val = ops[eq.opera].(lef)

    elseif eq.arity == 2
        lef = eval_equation(eq.lef, vars, ops)
        rig = eval_equation(eq.rig, vars, ops)
        cur_val = ops[eq.opera].(lef, rig)
    else

        cur_val = nothing
    end

    return cur_val
end

@btime eval_equation(eq1, vars, ops) setup = (
    eq1 = opera_node(2, 1, varia_leav(1), varia_leav(2));
    vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4));
    ops = (+, -, *, /, cos) 
)

yielding

  1.200 ns (0 allocations: 0 bytes)

Edit: eliminated the last allocation via benchmark setup, now constant propagation does it’s job :slight_smile:

Thanks you for your input. That really clarified if for me how the type syntax should be in this case. But somehow I get a wierd behavior with this solution. When I execute the exact code I get

0.001 ns (0 allocations: 0 bytes)

, which seems a little unrealistic and if I do the following

function scaling(eq1, vars, ops)
    for i = 1:1000
        eval_equation(eq1, vars, ops)
    end
end

@btime scaling(eq1, vars, ops) setup = (
    eq1 = opera_node(2, 1, varia_leav(1), varia_leav(2));
    vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4));
    ops = (+, -, *, /, cos) 
)
# -> 222.166 μs (5000 allocations: 406.25 KiB)

it doesn’t seem to scale the way it’s supposed to.
Your input helped me already very much :slight_smile:

Profiler shows something similar to

Only improvement I see in the current design is hard coding dispatch to ops like

function eval_equation(eq, vars, ops)
    if typeof(eq) <: const_leav
        zeros = vars[1] .- vars[1]  # to get a 0.0-tuple of appropriate length
        cur_val = zeros .+ eq.value # and ensure to return always same type

    elseif typeof(eq) <: varia_leav
        cur_val = vars[eq.value]

    elseif eq.arity == 1
        lef = eval_equation(eq.lef, vars, ops)
        if ops[eq.opera] === -
            cur_val = .- lef
        else
            cur_val = ops[eq.opera].(lef)
        end

    elseif eq.arity == 2
        lef = eval_equation(eq.lef, vars, ops)
        rig = eval_equation(eq.rig, vars, ops)
        if ops[eq.opera] === +
            cur_val = lef .+ rig
        elseif ops[eq.opera] === -
            cur_val = lef .- rig
        elseif ops[eq.opera] === *
            cur_val = lef .* rig
        elseif ops[eq.opera] === /
            cur_val = lef ./ rig
        else
            cur_val = ops[eq.opera].(lef, rig)
        end
    else

        cur_val = nothing
    end

    return cur_val
end

With this I get

  25.700 μs (1000 allocations: 62.50 KiB)

One question: is there a reason why the operation is coded as Int64 and we have to look it up in the passed tuple? Otherwise we might have a bit more wiggle room…

This looks a bit better

using BenchmarkTools

abstract type equation_nodes end

struct opera_node{O, L, R} <: equation_nodes
    arity::Int64; opera::O
    lef::L; rig::R

    opera_node(a::Int64, o::O) where {O} = new{O, Nothing, Nothing}(a, o, nothing, nothing)
    opera_node(a::Int64, o::O, l::L) where {O, L} = new{O, L, Nothing}(a, o, l, nothing)
    opera_node(a::Int64, o::O, l::L, r::R) where {O, L, R} = new{O, L, R}(a, o, l, r)
end

struct const_leav <: equation_nodes
    value::Float64
end

struct varia_leav <: equation_nodes
    value::Int64
end

function eval_equation(eq, vars)
    if typeof(eq) <: const_leav
        zeros = vars[1] .- vars[1]  # to get a 0.0-tuple of appropriate length
        cur_val = zeros .+ eq.value # and ensure to return always same type

    elseif typeof(eq) <: varia_leav
        cur_val = vars[eq.value]

    elseif eq.arity == 1
        lef = eval_equation(eq.lef, vars)
        cur_val = eq.opera.(lef)

    elseif eq.arity == 2
        lef = eval_equation(eq.lef, vars)
        rig = eval_equation(eq.rig, vars)
        cur_val = eq.opera.(lef, rig)

    else
        cur_val = nothing

    end

    return cur_val
end

function scaling(eq1, vars)
    for i = 1:1000
        eval_equation(eq1, vars)
    end
end

@btime scaling(eq1, vars) setup = (
    eq1 = opera_node(2, +, varia_leav(1), varia_leav(2));
    vars = ((1.0, 2.0, 3.4, 4.1, 5.3, 6.3), (7.0, 2.0, 2.4, 1.1, 5.0, 6.4))
)

yielding

  257.880 ns (0 allocations: 0 bytes)
1 Like

Wow, thank you very much :slight_smile: That works perfectly. I have tried so many iterations and I had the thought to provide the operation directly, but I am not that “smooth” with Julia yet.
As for “does it have to be an integer” : No, I have already functions that “grows” equations recursively and where the lookup happens doesn’t really matter for it. I just thought it might be beneficial to have it as an Integer.
Thanks again for the help and have a nice evening

1 Like

Yeah, performance prediction isn’t that easy (therefore my extensive use of a profiler to see hot spots) and I mostly need to experiment then to fix issues. My rule of the thumb: easier type inference for the compiler => better optimization possibilities.

Nice evening, too, and see you around!

Hello again,

unfortunately, the previous structure of the equations had serious drawbacks, since it took multiple orders of magnitude longer to build an equation. And my goal is to build a basis for multiple symbolic regression algorithms and try different algorithms, and maybe combine different ones. I am sorry, that I failed to specify the constraints in the last post. There is a wonderful package from Miles Cranmer, which is a bit to code heavy for trying different things out. But I adoped parts of his structure to try and get it to work for me. Thus, I changed the structure to consist only of one struct and type, and this works in some cases. But try as I might to change things around and look in the PProf Profiler, I can’t understand why Julia acts this way.

I define the struct, the function that evaluates, and the variables, and the example equation.

using BenchmarkTools

mutable struct Node
    ari::Int32
    val::Float32 
    ind::Int32
    lef::Node
    rig::Node

    Node(val::Float64) = new(-1, convert(Float32, val))
    Node(i::Int) = new(0, 0f0, i)
    Node(a::Int, op::Int) = new(a, 0f0, op)
    Node(op::Int, l::Node) = new(1, 0f0, op, l)
    Node(op::Int, l::Node, r::Node) = new(2, 0f0, op, l, r)
end

function eval_equation(node::Node, vars::Tuple, ops::Tuple)::typeof(vars[1])
    if node.ari == -1
        return node.val .+ (vars[1] .- vars[1])
    elseif node.ari == 0
        return vars[node.ind]
    elseif node.ari == 1
        lef = eval_equation(node.lef, vars, ops)
        return ops[1][node.ind].(lef)
    elseif node.ari == 2
        lef = eval_equation(node.lef, vars, ops)
        rig = eval_equation(node.rig, vars, ops)
        return ops[2][node.ind].(lef, rig)
    else 
        return vars[1]
    end
end

# variables
vars = (Float32.((1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0)), 
        Float32.((11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, 15.0)))

# equation
eq = Node(1, Node(2.0), Node(2, Node(5.0), Node(2, Node(2, Node(5.0), Node(1)))))

# available operations
ops = ((sin, cos, exp, abs, log, sqrt), (+, -, *, /, ^)) # allocates
# ops = ((sin, cos, exp), (+, -, *))                     # does not allocate 

@btime eval_equation($eq, $vars, $ops);

In any circumstance I tried so far, if I provide more than three types of functions in each of the two tuples named ops, the calculation allocates.
If there is no way to do that in this manner, I would have to spell everything out and make it more explicit without any broadcasting but with for loops, which should work. But seeing as it works perfectly well with three functions for each of the two tuples, I don’t understand why it would not for more than that. I also tried (among many other similar configuration) to outsource the calculation itself and dispach them based on the operator. But that didn’t work either.
Help would be very appreciated, but if push comes to shove, I’ll just spell it all out in for loops and allocate manually.

Thank you very much community :slight_smile:

Hi, one of the problems is that each function has its own type, and thus operations such as ops[1][i] are generally type unstable.

I’m not sure if this solves your problem, but just for fun I tried to make type-stable version of your code, keeping kind of the same logic in terms of “arities”:

struct FunNode{Arity, F, LeftNode, RightNode}
    f::F
    left::LeftNode
    right::RightNode
    FunNode{A}(f, left, right) where {A} =
        new{A, typeof(f), typeof(left), typeof(right)}(f, left, right)
end

FunNode(f, left, right) = FunNode{2}(f, left, right)
FunNode(f, left) = FunNode{1}(f, left, nothing)
FunNode(x::AbstractFloat) = FunNode{-1}(x, nothing, nothing)
FunNode(x::Integer) = FunNode{0}(x, nothing, nothing)

eq = FunNode(
    +,
    FunNode(2.0),
    FunNode(
        -,
        FunNode(5.0),
        FunNode(
            cos,
            FunNode(
                -,
                FunNode(5.0),
                FunNode(1),
            )
        ),
    ),
)

@inline function eval_equation(node::FunNode{2}, vars)
    lef = eval_equation(node.left, vars)
    rig = eval_equation(node.right, vars)
    map(node.f, lef, rig)
end

@inline function eval_equation(node::FunNode{1}, vars)
    lef = eval_equation(node.left, vars)
    map(node.f, lef)
end

@inline function eval_equation(node::FunNode{0}, vars)
    i = node.f :: Integer
    vars[i]
end

@inline function eval_equation(node::FunNode{-1}, vars)
    x = node.f :: Number
    # x .+ (vars[1] .- vars[1])
    map(_ -> x, vars[1])  # this is equivalent to the line above (but a bit faster)
end

@btime eval_equation($eq, $vars)   # 79.278 ns (0 allocations: 0 bytes)

For comparison, in your example I get: 332.151 ns (11 allocations: 528 bytes), and the same results. Note that here I’m using Float64, but switching to Float32 as in your example is trivial.

3 Likes

This is again, like the previous input from goerch, very fast in execution, but needs to compile anew for each equation and takes ages to generate a new equation. But thank you very much for that approach :slight_smile: To clarify, I need to generate thousands of equations, evaluate them, keep the slightly better ones, and modify them, and generate new ones (-> genetic programming in a nutshell). The generation looks like this (a bit simplified)

function grow_equation(rem_depth, vars, ops)
    """ builds the equations from bottom up """
    next_arity = rem_depth > 1 ? 2 : 0
        
    if next_arity == 0
        next_node = FunNode(ceil(Int64, rand() * length(vars)))
                    
    elseif next_arity == 2
        next_node = FunNode(ops[2][ceil(Int64, rand() * length(ops[2]))], 
                            grow_equation(rem_depth - 1, vars, ops), 
                            grow_equation(rem_depth - 1, vars, ops))
    end
    return next_node
end

function generate_many_equations(number, depth, vars, ops)
    eqs = FunNode[]
    for i = 1:number
        push!(eqs, grow_equation(depth, vars, ops))
    end
    return eqs
end

ops = ((sin, cos), (+, -, *))

@time eqs = generate_many_equations(10, 10, vars, ops); # 5000 equations would be the goal
@time for i in 1:10 eval_equation(eqs[i], vars) end # -> compilation for each equation
@time for i in 1:10 eval_equation(eqs[i], vars) end # -> blazing fast evaluation

In the proposed structure, the type of a parent FunNode is also dependent on the type of the subnode, which explodes exponentially for larger equations. I adapted it to have a stable type for each node, which is not dependent on the subnodes, but landed where I started :smiley:

I will take this day to try and get a good solution to work, but will probably settle by tomorrow for the time being. I guess this quite the special case. One can’t expect to have all the flexibility and performance at the same time, since type inflexibility is how performance is gained.

Thanks again :slight_smile: I really learn a lot from you guys

1 Like