What is the most efficient way to copy a tree with shared children? While this is technically a graph rather than a tree, here there is a clear root node and directionality, so it’s more intuitive to think of it as a tree with shared children nodes.
I am attempting to go from expression trees to expression graphs in SymbolicRegression.jl (pull request here).
Here is a simplified version of my data structure for a tree (full version here):
mutable struct Node
degree::Int # 0 for constant/variable, 1 for cos/sin, 2 for +/* etc.
val::Float32 # If is a leaf, this stores the value
op::Int # enum over operators
l::Node # Left child node. Only defined for degree=1 or degree=2.
r::Node # Right child node. Only defined for degree=2.
Node(val::Float32) = new(0, val)
Node(op::Int, l::Node) = new(1, 0f0, op, l)
Node(op::Int, l::Node, r::Node) = new(2, 0f0, op, l, r)
end
One can create a tree that has multiple nodes with the same child node. For example, the expression
\cos(x - 3.2 y) + \cos(2 (x - 3.2 y)) could be stored as the following tree:
Here, both the cos
on the left branch and the *
on the right branch would link to this shared -
node. Now, since this is implemented as a tree, how should this be copied? The normal way of copying this tree would be:
function copy_node(tree::Node)::Node
if tree.degree == 0
Node(copy(tree.val))
elseif tree.degree == 1
Node(copy(tree.op), copy_node(tree.l))
else
Node(
copy(tree.op),
copy_node(tree.l),
copy_node(tree.r),
)
end
end
The problem with this is that it breaks the topology: shared child nodes are duplicated! For example:
base_tree = Node(1, Node(1f0)) # e.g., if 1 is cos, cos(1.0)
tree = Node(1, base_tree, base_tree) # e.g., if 1 is +, cos(1.0) + cos(1.0)
objectid(tree.l) == objectid(tree.r) # true!
ctree = copy_node(tree)
objectid(ctree.l) == objectid(ctree.r) # false!
So, I use the following trick with IdDict
:
function copy_node(tree::Node, id_map::IdDict{Node,Node})::Node
get!(id_map, tree) do
if tree.degree == 0
Node(copy(tree.val))
elseif tree.degree == 1
Node(copy(tree.op), copy_node(tree.l, id_map))
else
Node(
copy(tree.op),
copy_node(tree.l, id_map),
copy_node(tree.r, id_map),
)
end
end
end
what this basically does is store a dictionary mapping objectid(node) => copied_node
. Thus, if a particular node (referenced by its ID) already has been copied, it simply returns that copy instead. This works!
ctree = copy_node(tree, IdDict{Node,Node}())
objectid(ctree.l) == objectid(ctree.r) # true
However, this is about significantly slower than a normal copy. I don’t understand the reason for this, because it’s only storing the reference information in an IdDict
. If anything I almost think it should be faster, since it results in fewer allocations. Is there a faster way I can copy a tree, preserving its topology?
Here’s a benchmark:
using BenchmarkTools
base_tree = Node(8, Node(1f0), Node(2, Node(3f0)))
tree = Node(2, base_tree, Node(2, Node(1, base_tree, base_tree), Node(2f0)))
@btime [copy_node($tree) for i=1:1_000];
# 101.292 μs (16001 allocations: 757.94 KiB)
@btime [copy_node($tree, IdDict{Node,Node}()) for i=1:1_000];
# 300.042 μs (10001 allocations: 711.06 KiB)
Even with fewer total allocations, for some reason this topology-preserving copy is 3x slower. Some part of this may just be constructing IdDict{Node,Node}()
- it seems like that part accounts for 30us. But, is there a more efficient way to do this?
Thanks!
Miles