Efficiently traverse binary tree represented as tuple of tuples

non-Jedi · July 30, 2019, 5:22pm

I’m trying to see if I can squeeze out a bit more performance from the
Julia implementation of the computer language benchmark game binary
trees
test. One
thing I noticed was that I could create binary trees very fast if I
represent them as nested tuples:

function make(::Val{n}) where n
    if n === 1
        ((), ())
    else
        (make(Val(n-1)), make(Val(n-1)))
    end#if
end#function

But I can’t seem to come up with a function that traverses this
structure efficiently including compile time (the computer
language benchmark game includes jit warmup in its benchmark, which
for Julia means including compilation). If I define the traversal
function the naive way:

check(node) = @inbounds 1 + check(node[1]) + check(node[2])
check(::Tuple{}) = 1

The function takes a very long time on first use (longer than the
entire test is currently completed in) but is efficient thereafter. If
I toss a @nospecialize in, it takes about a full second on each
invocation on larger trees which is unacceptable.

Any ideas for efficiently traversing this tree without having to take
ages to compile every Tuple{Tuple{Tuple{Tuple type signature? I
don’t think there’s a way, but I thought I’d see if anyone in the
community had ideas.

Probably what I really need to do is get the maintainer to include
PackageCompiler in his default Julia build-chain.

non-Jedi · July 30, 2019, 5:27pm

Also does anyone know why sizeof(make(Val(20))) returns 0?

tkoolen · July 30, 2019, 5:36pm

The output of make contains no runtime information; all ‘data’ is statically-known type information.

This is almost definitely not allowed in light of the work description:

The work is to fully create perfect binary trees - before any tree nodes are GC’d - using at-minimum the number of allocations of Jeremy Zerfas’s C program. Don’t optimize away the work.

(emphasis mine). You’d basically be having the compiler compute the answer and storing it in the compiled program.

non-Jedi · July 30, 2019, 5:49pm

For this case, I’m thinking that it wouldn’t be an issue. The compiler
still has to store the types of the the tuples somewhere which for
this purpose is equivalent to storing the tuple
itself. Tuple{Tuple{Tuple... still has to be stored in memory
somewhere I think. And the invocation of check on each binary tree
means the compiler can’t simply elide creating anything.

tkoolen · July 30, 2019, 6:04pm

julia> @code_native make(Val(5))
        .text
; ┌ @ REPL[1]:2 within `make'
        movq    %rsi, -8(%rsp)
; │ @ REPL[1]:5 within `make'
        movabsq $140550904238320, %rax  # imm = 0x7FD48EB850F0
        retq
; └

julia> @code_native check(make(Val(5)))
        .text
; ┌ @ REPL[4]:1 within `check'
        movl    $63, %eax
        retq
        nopw    %cs:(%rax,%rax)
; └

non-Jedi · July 30, 2019, 6:05pm

Ah. Hrm.

juthohaegeman · July 30, 2019, 8:50pm

Not sure of this is relevant, but I used to store a tree as nested tuple for some internal routine in TensorOperations.jl (in that case the nodes had a value, a simple Int, so the tree was something like (1, ((2,3), 4))). However, this created a huge slowdown for larger trees:
https://github.com/Jutho/TensorOperations.jl/issues/63

The new version uses a completely untyped node type

struct BinaryTreeNode
	    left
	    right
	end

which seems to perform significantly better for large trees.

non-Jedi · July 30, 2019, 10:01pm

I’m getting a 10% speedup versus Union{Node,Nothing} fields. Very odd. Thanks!

non-Jedi · July 31, 2019, 12:07pm

Interestingly a simplified benchmark does not bear this out (was going
to file an issue on the Julia repo). Although my total runtime is
shorter with the untyped fields in the benchmarks game benchmark,
traversing the tree recursively seems to be slower when replicated
with untyped fields:

julia> struct NodeA
           l::Union{NodeA,Nothing}
           r::Union{NodeA,Nothing}
       end

julia> struct NodeB
           l
           r
       end

julia> make(n, T) = n == 1 ? T(nothing, nothing) : T(make(n-1, T), make(n-1, T))
make (generic function with 1 method)

julia> countnodes(node) = node === nothing ? 1 : 1 + countnodes(node.l) + countnodes(node.r)
countnodes (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark make(21, NodeA)
BenchmarkTools.Trial: 
  memory estimate:  64.00 MiB
  allocs estimate:  2097151
  --------------
  minimum time:     349.933 ms (0.00% GC)
  median time:      358.420 ms (0.00% GC)
  mean time:        437.984 ms (19.46% GC)
  maximum time:     688.123 ms (51.11% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark make(21, NodeB)
BenchmarkTools.Trial: 
  memory estimate:  64.00 MiB
  allocs estimate:  2097151
  --------------
  minimum time:     389.716 ms (0.00% GC)
  median time:      394.468 ms (0.00% GC)
  mean time:        475.543 ms (17.67% GC)
  maximum time:     725.103 ms (48.57% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> tree_a = make(21, NodeA);

julia> tree_b = make(21, NodeB);

julia> @benchmark countnodes($tree_a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.632 ms (0.00% GC)
  median time:      25.061 ms (0.00% GC)
  mean time:        25.345 ms (0.00% GC)
  maximum time:     33.658 ms (0.00% GC)
  --------------
  samples:          198
  evals/sample:     1

julia> @benchmark countnodes($tree_b)
BenchmarkTools.Trial: 
  memory estimate:  127.97 KiB
  allocs estimate:  8190
  --------------
  minimum time:     95.624 ms (0.00% GC)
  median time:      97.027 ms (0.00% GC)
  mean time:        98.077 ms (0.00% GC)
  maximum time:     109.683 ms (0.00% GC)
  --------------
  samples:          52
  evals/sample:     1

No real detectable difference in time to construct the tree
though. This result is consistent with even larger trees:

julia> tree_a = make(25, NodeA);

julia> tree_b = make(25, NodeB);

julia> @benchmark countnodes($tree_a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     416.255 ms (0.00% GC)
  median time:      428.381 ms (0.00% GC)
  mean time:        436.403 ms (0.00% GC)
  maximum time:     498.951 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1

julia> @benchmark countnodes($tree_b)
BenchmarkTools.Trial: 
  memory estimate:  2.00 MiB
  allocs estimate:  131070
  --------------
  minimum time:     1.585 s (0.00% GC)
  median time:      1.592 s (0.00% GC)
  mean time:        1.662 s (0.00% GC)
  maximum time:     1.881 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

And just for the heck of it, smaller trees:

julia> tree_b = make(12, NodeB);

julia> tree_a = make(12, NodeA);

julia> @benchmark countnodes($tree_a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.210 μs (0.00% GC)
  median time:      14.282 μs (0.00% GC)
  mean time:        14.486 μs (0.00% GC)
  maximum time:     54.675 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark countnodes($tree_b)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  14
  --------------
  minimum time:     138.993 μs (0.00% GC)
  median time:      139.330 μs (0.00% GC)
  mean time:        155.135 μs (0.00% GC)
  maximum time:     553.816 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I’m getting quite confused about where the observed speedup is coming
from. Maybe there’s some other difference I’m not seeing in the code I
submitted above compared to the earlier submissions.

EDIT: Aaand it turns out the type annotations were a red herring. With
nothing else changed but adding the type annotations back in, I
saw the following when benchmarking the whole program. The type
annotations improve rather than degrade performance; as expected.

❮ hyperfine --warmup 1 'julia -p 3 -O3 binary-trees-ab-untyped-struct.jl 21 > myoutput21.txt'
Benchmark #1: julia -p 3 -O3 binary-trees-ab-untyped-struct.jl 21 > myoutput21.txt
  Time (mean ± σ):     19.054 s ±  3.582 s    [User: 37.494 s, System: 1.082 s]
  Range (min … max):   17.351 s … 29.153 s    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 

~/repos/benchmark-game/binary-trees master* ⇡ 3m 29s
❯ hyperfine --warmup 1 'julia -p 3 -O3 binary-trees-ab-retyped-struct.jl 21 > myoutput21.txt'
Benchmark #1: julia -p 3 -O3 binary-trees-ab-retyped-struct.jl 21 > myoutput21.txt
  Time (mean ± σ):     13.234 s ±  0.758 s    [User: 28.711 s, System: 1.082 s]
  Range (min … max):   12.736 s … 15.224 s    10 runs

Something else I changed caused the speedup; not the untyped fields.

macd · August 7, 2019, 2:25am

Interesting. I find results that are similar to Jutho’s. I have an implementation of Sedgwick’s left leaning red black trees (which is recursive). If I type the left and right node pointers I see a 4X slow down in creating a tree with 1M integer keys. The cost of querying all is about the same for the two versions. To be clear, the only change in the to versions is the type of the left and right fields of the node struct. I did not investigate further because the strategy of allocating each node separately is a loser.

# using untyped left, right

julia> ivec = rand(-30000000:30000000, 1000000);

julia> @benchmark make_tree($ivec)
BenchmarkTools.Trial: 
  memory estimate:  53.03 MiB
  allocs estimate:  991744
  --------------
  minimum time:     4.187 s (1.87% GC)
  median time:      4.333 s (2.89% GC)
  mean time:        4.333 s (2.89% GC)
  maximum time:     4.479 s (3.84% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> tree = make_tree(ivec);


julia> @benchmark find_all_keys($tree, $ivec)
BenchmarkTools.Trial: 
  memory estimate:  1.26 GiB
  allocs estimate:  83484932
  --------------
  minimum time:     3.798 s (0.49% GC)
  median time:      3.804 s (0.50% GC)
  mean time:        3.804 s (0.50% GC)
  maximum time:     3.809 s (0.50% GC)
  --------------
  samples:          2
  evals/sample:     1



# using Union{LLRBTreeNode, Nothing} for the types of left, right

julia> @benchmark make_tree($ivec)
BenchmarkTools.Trial: 
  memory estimate:  330.67 MiB
  allocs estimate:  19187151
  --------------
  minimum time:     18.779 s (2.50% GC)
  median time:      18.779 s (2.50% GC)
  mean time:        18.779 s (2.50% GC)
  maximum time:     18.779 s (2.50% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> tree = make_tree(ivec);

julia> @benchmark find_all_keys($tree, $ivec)
BenchmarkTools.Trial: 
  memory estimate:  1.28 GiB
  allocs estimate:  84970712
  --------------
  minimum time:     3.871 s (0.47% GC)
  median time:      3.884 s (0.46% GC)
  mean time:        3.884 s (0.46% GC)
  maximum time:     3.896 s (0.45% GC)
  --------------
  samples:          2
  evals/sample:     1

Topic		Replies	Views
Help with binary trees benchmark games example General Usage benchmarks	53	3614	May 6, 2021
Reduce compilation time for long and nested tuples Performance tuple , compilation	4	955	December 9, 2022
Advanced iteration interface in Julia Performance iterators , inline , box	5	700	September 23, 2021
Julia Binary Heap Benchmark Performance benchmark , java	12	1311	December 17, 2020
Performance of recursively building tuples Performance	6	528	November 30, 2020

Efficiently traverse binary tree represented as tuple of tuples

Related topics