# Best way to store a sparse, unbalanced, and not complete binary tree

I am implementing a binary tree where given the key of the root and a function the left and right children are defined up to a given depth that is an other parameter of the constructing function: `create_tree!(tree, key, func, depth)` . The most straightforward way in my opinion is to define tree as a `Vector{Union{Int,Nothing}` . This solution is clear, in my opinion. However in my case `func` is such that if `depth=15` around 97% of the nodes in `tree` happen to be `nothing` . Since I would like to avoid wasting memory and I need even deeper trees I am asking: is there a good way to implement such a binary tree in Julia? To give more context I can also add that `func` is such that is more probable that the left child of a given node is `nothing` with respect to the right child. In a language like C or C++ I would use pointer based construction of a tree. I suppose I could define a Julia type that acts like a pointer but I am wondering if there is a easier and more straightforward solution.

Iād probably use

``````mutable struct Node{T}
value::T
left::Union{Nothing, Node{T}}
right::Union{Nothing, Node{T}}
end
const Tree{T} = Union{Nothing, Node{T}}
``````
2 Likes

Agree with @goerchā answer, and we can do away with `Union{Nothing, Node{T}}` by incomplete initialization, which makes it acts more like a pointer thing.

``````mutable struct TreeNode{T}
value::T
parent::TreeNode{T}
left::TreeNode{T}
right::TreeNode{T}
function TreeNode{T}() where T
n = new{T}()
n.parent = n
n.left = n
n.right = n
n
end
TreeNode{T}(v, p, l, r) where T = new{T}(v, p, l, r)
end
``````

With this definition, some basic functions of the linked-binary-tree need to be redefined, such as `isroot(n::TreeNode) = n === n.parent`. I implemented a linked-quadtree type in my project. Things are similar. I hope it will help you.

3 Likes

Interesting. On one hand I can imagine that using `undef` that way could simplify type inference. OTOH you need to use `isassigned(p)` instead of `p !== nothing` and worse to reset `p` to `undef` youād need something like `unsafe_store!` as @jakobnissen pointed out recently on slack.

Depends on the application, as always.

If you donāt need to modify the tree structure after construction, Iām a big fan of the following layout. It is optimized for the case that you need to traverse left more often than right (following left typically hits the same cache-line that you already have loaded), so you may need to internally rename left-right. Only existing nodes are stored (no `nothing` stuff). There is no depth-bound, you simply have a bunch of nodes.

The structure is

``````struct Tree{T}
keys::Vector{T}
child_right::Vector{Int} #or Int32 or UInt32...
end
``````

The order is:

``````[root, L, LL, ...., LR, LRL, LRLL, ..., R, RL, ...]
``````

(in other words, attach to each node the āLā/āRā string corresponding to the path in the tree to reach the node; then sort lexicographically; for each node, store as metadata the offset to (or position of, depending on what you find more convenient) `string_label(node) * "R"` if that node exists in the tree and 0 otherwise)

Subtrees are always sequential in memory and their size is known without further searching. Searching for keys is always accessing memory in sequential ascending order, unless youāre e.g. searching for a best-match. When the remaining subtree is small enough, you can seamlessly switch to linear search.

It is quite useful in practice to also include parent pointers for convenience.

Structure is also good for easily parallelized in-place construction or computing and storing tree-stuff for every node (spawn a task for the right subtree if size above threshold; due to the memory layout you very rarely have slowdowns due to different cores writing to the same cacheline).

PS. You must identify subtrees by the pair `(subtree_root_index, subtree_upper_bound)`. The upper bound is computed during traversal from root. Otherwise you cannot distinguish the situations where a node has no children at all, or no right child but a nonempty left subtree.

If you include extra metadata (e.g. parent pointers, subtree_upper_bound, domain-specific info) then it is advisable to use something like

``````struct NodeMetadata
child_right::Int
#e.g. parent::Int
...
end

struct Tree{T}
data::Vector{T}
end
``````

for the sake of better memory access.

I advise against attempting to mix data and metadata ā that opens a big can of structure padding worms, makes SIMD harder and makes it harder to use domain-specific implementation for linear search on small subtrees.

4 Likes

Is this the vine data structure representing a tree traversal?

Itās original (in the sense that I once needed something like this and came up with it), but almost certainly not novel. Itās the kind of stuff that is easier to reinvent or ask discourse / stackoverflow / āoral communicationsā about than doing a literature search for.

If you happen to know a famous name for this layout, please share! A wiki link could save me some source code comments and/or pages of explanations in papers that use this.

The vine stuff you linked to looks different? The wiki talks about ārebalancingā, while I only gave a memory layout and metadata style for a given tree. How do build it depends on your domain ā there are very good domain-specific reasons for unbalanced trees. E.g. if your datapoints have probability mass then you want your tree to be balanced with respect to mass, not with respect to number-of-points. If your datapoints live in a metric space, you may want to balance your tree by diameter, not number-of-points (that was my domain). Either way you end up with a tree that is horribly unbalanced for good reasons and that must not be rebalanced according to naive ānumber of pointsā.

Hence the above layout takes some pains to not be worse than a flat array (no linked-list heap-vomit with large overheads in case of completely degenerate tree, modest metadata overhead, etc).

1 Like

Can you provide a link to `@jakobnissen`'s proposal?
In fact, I initialized the nodeās field with itself instead of leaving them `undef`ined. I wonder if there is any performance difference in accessing uninitialized fields. Or is there always a protective check, whether it is initialized or not?

Here is @jakobnissen 's example

``````julia> A = ["hello"]
1-element Vector{String}:
"hello"

julia> unsafe_store!(Ptr{UInt}(pointer(A, 1)), zero(UInt));

julia> A
1-element Vector{String}:
#undef
``````

You should make sure `Union{String, Nothing}` isnāt good enough first.

If you store it in a struct:

``````julia> struct Foo
x::Union{String, Nothing}
end

julia> sizeof(Foo)
8
``````

Then the union type takes no extra memory. Itās simply stored as a pointer either to the `Nothing` type (I think), or to the actual string.
If you use an array of the kind `Vector{Union{String, Nothing}}`, then the memory layout will be two dense arrays, one with pointers, and another byte array with `0x00` for `Nothing` and `0x01` for `String`, for a total of only 9 bytes per object, which is only 13% more than the 8 bytes a pure `String` array would take.

No need to! The code looked terrible enough for me to try to find another solutionā¦

Here we discuss if using uninitialized or partially initialized `struct`s could provide any benefits. Does your technique apply to `struct`s also? Iām nearly tempted to write a micro benchmark.

Yes, you can use `pointer_from_objref` to get a pointer to a mutable type, and for immutable types, you can put (probably?) them in a `Ref`, then take a pointer to the `Ref`. Again,

1 Like

Edit: I obviously have to use mutablesā¦

I tried this

``````mutable struct Example
el::String
Example() = new()
Example(el) = new(el)
end

B = Example("hello")
unsafe_store!(Ptr{UInt}(pointer_from_objref(B)), zero(UInt), 1);
println(B)
``````

yielding

``````Example(#undef)
``````

Thanks!

Interesting. IMHO the micro benchmark exactly shows the tradeoffs:

``````using BenchmarkTools

mutable struct Node1{T}
value::T
next::Union{Nothing, Node1{T}}
end
const List1{T} = Union{Nothing, Node1{T}}

function test1()
list = nothing
for i in 1:1000
list = Node1(i, list)
end
nodes = []
s = 0
while list !== nothing
s += list.value
push!(nodes, list)
list = list.next
end
for node in nodes
node.next = nothing
@assert node.next === nothing
end
s
end

mutable struct Node2{T}
value::T
next::Node2{T}
Node2{T}(value) where T = new{T}(value)
Node2{T}(value, next) where T = new{T}(value, next)
end
const List2{T} = Union{Nothing, Node2{T}}

function test2()
list = Node2{Int}(1)
for i in 2:1000
list = Node2{Int}(i, list)
end
nodes = []
s = 0
while list !== nothing
s += list.value
push!(nodes, list)
list = isdefined(list, :next) ? list.next : nothing
end
for node in nodes
ptr = convert(Ptr{UInt}, Base.pointer_from_objref(node))
unsafe_store!(ptr, zero(UInt), 2)
@assert !isdefined(node, :next)
end
s
end

res1 = test1()
res2 = test2()
@assert res1 == res2

@btime test1()
@btime test2()
``````

yielding

``````  104.300 Ī¼s (1006 allocations: 53.09 KiB)
24.600 Ī¼s (1006 allocations: 53.09 KiB)
``````

Which type annotations are needed to eliminate dynamic dispatch in the first test case?

Maybe it would be better to have a new function like `setasundef!` in julia `Base`.

I donāt get the benchmarking - what two, different, functions are you benchmarking?

Inserting into a list, sum the list entries and set all next pointers to nothing or undefined respectively. First data structure uses `Union{Nothing, Node1{T}}` as pointer to next element, second one uses `Node2{T}` with `undef` as null indicator.

Whoops, I didnāt see the code box could scroll down
Anyway, if you solve the type instability by specifying the type of the array, the union type is faster for me:

``````julia> include("foo.jl")
WARNING: replacing module Foo.
11.524 Ī¼s (1006 allocations: 53.09 KiB)
18.020 Ī¼s (1006 allocations: 53.09 KiB)
``````
1 Like

Yes, but this is partly the point: with `Union{Nothing,T}` you sometimes have to care about type instabilities, see Fix type instabilities in AVL, RB and splay tree by goerch Ā· Pull Request #766 Ā· JuliaCollections/DataStructures.jl Ā· GitHub for more examples.

1 Like