Best approach for runtime dispatching inside a hot loop (heterogeneous tree structure)

AleMorales · March 2, 2018, 8:10am

I am in the process of writing a ray tracing algorithm in Julia (not related to 3D rendering, but for physical simulations). The code is too complex and I have not been able to make a good MWE, but I will try to describe the problem as best as I can and I would love people if people could comment on alternatives or their own experience solving similar problems.

I have a binary tree where the leaf nodes contain data of different types (i.e. triangles, circles, cylinders, etc). For each ray, the algorithm traverses the tree until it finds a leaf node where the ray hits and then I need to check whether the ray intersects the geometry or not.
Since each node may have a different type of geometry, the intersection method is selected at runtime. On my machine, runtime dispatch (on Julia 0.6) takes about 100 ns, which can be 10 slower than the actual time to calculate the intersection test. So this makes my algorithm at least an order of magnitude slower.

As alternative to runtime dispatch I am considering:

Encode the type as field and do dispatch manually (if-else). This would force me to make one data structure to fit all the geometry which would waste memory. And the code would get really C-ish ugly…
Using an union type. On 0.6 this did not help, but I think on 0.7 this will get faster, but will it be as fast as the approach above? I know there was a comparison some time ago by benchmarking on an array (can’t find the post sorry) but it seems to me that is a different problem from tree traversal in ray tracing.

I have tried to make my algorithm type stable despite having nodes of different types but did not manage. Has anyone found a good way to do this? (I know about FunctionWrappers.jl, and TypedSortedCollections.jl but I don’t see how this would fit this particular data structure and algorithm)

I would love to hear about similar problems others may have solved, or whether there is a general recommendation for dealing with heterogeneous tree structures (not arrays!) in Julia

Thanks in advance!

Elrod · March 2, 2018, 9:50am

This discussion may be of interest.

I haven’t played around with union dispatching in 0.7 yet. On an 8-day old master:

julia> using BenchmarkTools, Random

julia> x = randn(20);

julia> u = Vector{Union{Float64, Float32}}(x);

julia> @benchmark exp($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.207 ns (0.00% GC)
  median time:      4.228 ns (0.00% GC)
  mean time:        4.286 ns (0.00% GC)
  maximum time:     19.357 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark exp($u[1])
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     24.061 ns (0.00% GC)
  median time:      30.399 ns (0.00% GC)
  mean time:        36.888 ns (15.74% GC)
  maximum time:     38.634 μs (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     996


julia> @benchmark f($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.212 ns (0.00% GC)
  median time:      1.213 ns (0.00% GC)
  mean time:        1.221 ns (0.00% GC)
  maximum time:     14.737 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark f($u[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.077 ns (0.00% GC)
  median time:      4.107 ns (0.00% GC)
  mean time:        4.107 ns (0.00% GC)
  maximum time:     19.216 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Versus 0.6.2:

julia> using BenchmarkTools, Random

julia> x = randn(20);

julia> u = Vector{Union{Float64, Float32}}(x);

julia> @benchmark exp($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.582 ns (0.00% GC)
  median time:      7.753 ns (0.00% GC)
  mean time:        7.932 ns (0.00% GC)
  maximum time:     20.750 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($u[1])
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.851 ns (0.00% GC)
  median time:      26.194 ns (0.00% GC)
  mean time:        27.828 ns (2.21% GC)
  maximum time:     965.944 ns (93.62% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> f(x::Float64) = 2x
f (generic function with 1 method)

julia> f(x::Float32) = 2+x
f (generic function with 2 methods)

julia> @benchmark f($x[1])
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.442 ns (0.00% GC)
  median time:      1.463 ns (0.00% GC)
  mean time:        1.466 ns (0.00% GC)
  maximum time:     16.591 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark f($u[1])
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.675 ns (0.00% GC)
  median time:      12.719 ns (0.00% GC)
  mean time:        13.708 ns (4.71% GC)
  maximum time:     999.245 ns (95.45% GC)
  --------------
  samples:          10000
  evals/sample:     998

For comparison, cost of an if statement is close to 1 ns. With if statements, you also don’t have to worry about squashing type instability that can result from dynamic dispatches.

If there’s some sort of pattern that lets you use Base.Cartesian.@nif, you could still be relatively concise with the control flow.

rdeits · March 2, 2018, 2:44pm

Rather than relying on run-time dispatch, this could be an excellent use case for GitHub - yuyichao/FunctionWrappers.jl . Instead of dispatching on the object itself, you can just store a wrapper to the already-dispatched raycast method. For example:

struct Sphere
  r::Float64
end

struct Box
  s::Float64
end

function raycast(sphere::Sphere, point::Float64)
  sphere.r - point
end

function raycast(box::Box, point::Float64)
  box.s + point
end

geometries = [Sphere(1.0), Box(2.0)]

We can make a closure p -> raycast(g, p) for each geometry g and then
store that closure as a function wrapper:

using FunctionWrappers: FunctionWrapper

raycast_handles = [
    FunctionWrapper{Float64, Tuple{Float64}}(p -> raycast(g, p)) 
    for g in geometries
]

Now each element in raycast_handles is of the same type (FunctionWrapper{Float64, Tuple{Float64}}) but is still bound to its particular geometry. Calling the wrapped method should not incur any dynamic dispatch:

(tested in v0.6.2)

julia> function raycast_scene(handles, point)
         sum(h -> h(point), handles)
       end
raycast_scene (generic function with 1 method)

julia> raycast_scene(raycast_handles, 2.0)
3.0

julia> using BenchmarkTools

julia> @btime raycast_scene($raycast_handles, 2.0)
  20.159 ns (0 allocations: 0 bytes)
3.0

In your case, you can just store a function wrapper at each leaf of the tree. You can also store the geometry in an abstract field, since just having that field won’t create any dynamic dispatch unless your code actually uses it (and a little dynamic dispatch outside of the main loop is fine):

struct Leaf
  raycast_handle::FunctionWrapper{Float64, Tuple{Float64}}  # presumably not the actual input/output types you want
  geometry::AbstractGeometry
end

I also have the (unregistered) Interfaces.jl https://github.com/rdeits/Interfaces.jl/blob/master/demo.ipynb which might make this pattern easier.

AleMorales · March 2, 2018, 4:04pm

OK, thanks. I will explore this option. I see the FunctionWrappers are storing Ptr{Void}. Is it doing something analogous to casting a *void to the right type? I ask because that is an approach I have seen in other ray tracers.

I am a little confused about your Leaf example: the function wrapper will store the data captured in the closure right? So there is no need to store the geometry separately…or did I misunderstood something there?

rdeits · March 2, 2018, 4:12pm

You’re correct that storing the geometry is not necessary for the function wrapper to work. I only suggest doing it because you might want to access the geometry for some other purpose at some point.

ChrisRackauckas · March 2, 2018, 4:13pm

I created MultiScaleArrays.jl to be able to put a representation of biological organisms directly into the ODE and SDE solvers.

It’s a heterogeneous tree structure and it works quite well, though I don’t put functions inside of it. The one thing to note is that doing linear algebra is so much faster when avoiding the type that you want to have a contiguous cache array to write into first (DiffEq has this internally if the type is sufficiently weird, don’t even ask how that’s done…). But with this type I’ve noticed that broadcast (which works recursively) can (and usually is) faster than broadcast on the contiguous arrays, something I chalk up to being about cache efficiency but could be wrong. So it comes out as quite flexible yet performant if the algorithm is making use of broadcast and pre-caching whenever possible.

Maybe that repo’s ideas could help somewhere.

rdeits · March 2, 2018, 4:14pm

I think function wrappers are essentially storing a C function pointer (which you could cast to and from void*), which is in turn essentially how a virtual method in C++ is implemented under the hood.

Stephen_Vavasis · March 2, 2018, 10:03pm

Let me just mention the following: if you decide to use if statements, the compiler has an optimization to recognize the following pattern

   if key == 1 
     ...
   elseif key == 2
     ...
   elseif key == 3
     ...
   else
   ...
   end

and transforms it to a jump-table in the machine-language code so that the running time is independent of the number of cases. Unfortunately, this optimization is broken on some platforms in 0.6 (i.e., erroneous machine-code is emitted) because of an LLVM bug, but in an earlier discourse discussion Yichao Yu posted instructions how to patch your 0.6 for this particular bug.

foobar_lv2 · March 3, 2018, 12:11am

Regarding heterogenous tree-structure: It may be helpful to store your nodes in an array and links as offsets into this array. I am assuming from your question that the handling of the shapes dominates your total time (as compared to walking the tree)?

I am asking because the tree may not end up very cache-friendly, depending on allocation order.

And if you choose a flat storage then the entire thing changes: you would likely store a manual node-type selector in the array, and have several homogeneous extra arrays for your various shapes and non-bitstype-data. In other words, you would need to do manual dispatch already for extracting the shape-data, and hence would never encounter julia’s runtime-dispatch.

AleMorales · March 3, 2018, 5:39am

I give more info, as the algorithm may not be that clear, sorry. At the moment all my nodes are just:

struct Node
  box::AABB
  leaf::Bool
  prims0::Int 
  prims1::Int 
end

Traversal into children is based on arity of the tree assuming a flattened n-ary (by default binary) dense tree and based on whether the tree intercepts the AABB of the current node or not.

A leaf node may contain several primitives (of different types), partly to make it more cache friendly but also because of some cost heuristics. At the moment they are stored contiguously (prims0 and prims1 in an array of geometry objects). Considering this, if I store the different types of geometry in different arrays, wouldn’t this make the algorithm less cache-friendly and the traversal much more complex?

AleMorales · March 3, 2018, 5:49am

Wow cool, thanks. For reference, this is the post: Match LLVM with Julia source? - #3 by Stephen_Vavasis.

y4lu · March 3, 2018, 7:42am

I may be reading that too literally, but assuming there is quite a few rays, doesn’t that make for a lot of tree traversals? Kernel tricks and wave-tracing /etc ideas aside, there must be a way to trace a small bunch of rays at once?

Added - I guess it depends on the ray origins, but for a single point of origin with many rays you might do something like

map all geometry coords to unit-vector from origin, pick all the closest (unmapped) z distances and non-overlapped points on the unit-sphere. (Though this may get a bit wonky for some shapes)
try to dispatch a bunch of rays at once to that geometry type?

AleMorales · March 3, 2018, 9:06am

Yes, sorry I did not want to go into too much detail. If the rays are coherent (similar origin and direction) I will use a packet, though there seem to be different ways to implement that and I am going through the details.

In my simulations I start by shooting rays from a source of irradiance and I can package these primary rays into coherent packets. But at soon as they hit a geometry, they get reflected/transmitted and the system becomes quite incoherent so I am not sure packing will help there (instead I am looking into multi-BVH and trying to vectorize (SIMD) the intersection tests during traversal).

Most of the time the ray sources I need are directional (different origin, same direction) so probably transforming all geometry for each ray would be too expensive.

y4lu · March 3, 2018, 9:45am

Yeah, it would be like O n^2 for n rays and n points?
You could do something a bit similar though, a 3D space rotation that could be similarly flattened to 2D, if all rays are moving in the exact same direction?
/ I’m probably a bit out of my depth though, i’ll have to read about multi-BVH

foobar_lv2 · March 3, 2018, 4:04pm

First of all, the change would not affect traversal at all, only processing of the primitives.

So your Node is bitstype, they are stored in an array and the tree-structure is encoded into the array-positions? (I have not seen the definition of AABB)

What tree-layout did you choose? How good is your locality at the leaves, anyway?

If your array of primitives stores them boxed (i.e. is a pointer-array) then your spatial locality will unambiguously improve because they will be splattered all over the heap. In this case you should once run a test where you deepcopy your tree and primitives and see whether it speeds up your search (deepcopy has the side-effect of giving a nice memory layout for objects splattered over the heap). If they are a bits-union then it might be a wash.

So storing the different primitives in separate homogeneous arrays looks like it would make sense. If you don’t want to increase your node-size with the selector, you can probably save some space by making your offsets prims0, prims1 Int32 and putting an Int16 selector after the leaf (you should have structure padding there). You could also consider appropriating the first bit of one of the prims for storing whether you have a leaf (I am assuming that prims0 is not supposed to be negative).

The only problem is that your code might become ugly, especially if you have many primitive types. It may be possible to do something with macros, but I’m too bad at meta-programming to give you tips.

AleMorales · March 3, 2018, 5:09pm

Yes to all, AABB is two SVector{3, Float32}.

Not sure what you mean by layout though. I am trying binary trees, quadtrees and octrees because I have read the latter can improve locality if they are stored contiguously (which I do) and may even be possible to use SIMD for the ray-AABB intersection. I just assume the tree is dense so the layout is just a flattened tree (breadth-first, so that children nodes are stored sequentially) and a simple formula suffices for traversal (e.g. for a binary tree the children of node i are 2i + 1 and 2i + 2, assuming the root node is at index 0).

Sorry, my knowledge is too limited on this approach so I am having problems understanding the concept of selector. Could you given a reference for the concept so that I can learn more about it? Thanks!

Yes, I know that a whole Bool is a waste, but I still need to learn how to do bit twiddling… The main reason I chose Julia rather than C++ is to stay abstract and beautiful while also being fast, but maybe I want too much

tkoolen · March 3, 2018, 6:53pm

Re: TypedSortedCollections.jl (I’m the author): it might indeed not be the best fit for this application. TypeSortedCollections are really meant for storing a number of elements of different types and then processing all of them in one go (using foreach, mapreduce, map!, or broadcast!).

In RigidBodyDynamics.jl, my solution before TypeSortedCollections was basically the one that Stephen_Vavasis proposed, but with a macro to automate the process and with isa checks as the conditions. Since this is very similar to run-time type information dispatch in C++, I called the macro @rtti_dispatch:

https://github.com/tkoolen/RigidBodyDynamics.jl/blob/dbeece208cc21b4cde8b66bac179e8e13137f8e5/src/util.jl#L37-L53

The way you would use it was:
https://github.com/tkoolen/RigidBodyDynamics.jl/blob/dbeece208cc21b4cde8b66bac179e8e13137f8e5/src/joint.jl#L37

so you’d give it a tuple of types to check for in advance (a major drawback of this approach). Using @rtti_dispatch resulted in type-stable code and decent performance, but switching to TypeSortedCollections was needed to not have a limited number of blessed joint types and to further improve performance. But yeah, it required a bit of thinking to cast every operation as a foreach/map!/mapreduce/broadcast!.

For isbits types @rtti_dispatch is probably completely redundant in 0.7.

foobar_lv2 · March 3, 2018, 8:54pm

I just meant that you do e.g.

const SPHERE_T = Int16(0)
const CYLINDER_T = Int16(1)
...
struct Node
  box::AABB
  leaf::Bool
  prim_type::Int16
  prims0::Int32 
  prims1::Int32 
end

function raycast(node::Node, meta_struct, args...)
...
if node.prim_type == SPHERE_T
  for i=node.prims0:node.prims1
      do_something(meta_struct.spheres[i])
  end
elseif node.prim_type == CYLINDER_T
 ....
else
  error("wtf, unknown primitive type")
end
...

Depending on what kind of other info you want, you might also use flags (instead of by-hand enum), or maybe Int8 is enough (but this should not matter due to alignment requirements)

Or, if you prefer julia-enums:

@enum Prim_type::Int16 SPHERE_T=0 CYLINDER_T=1 ...
...
prim_type::Prim_type
...

ScottPJones · March 3, 2018, 10:46pm

Bit twiddling is possible in Julia, a bit more difficult than in C/C++ at times, but there’s so many advantages to using Julia for the rest, that I find it worth the effort.

Topic		Replies	Views
Performance optimizations with runtime dispatch Performance	14	866	May 30, 2024
The mechanics of dispatch General Usage	1	514	January 14, 2017
Experiments in speeding up runtime dispatch for integer value specialization Performance	28	1183	November 18, 2022
Why is this code run-time dispatch/slow? Performance	16	834	September 25, 2020
Ray Tracing in a week-end - Julia vs SIMD-optimized C++ Performance gpu , graphics , benchmark , pluto , staticarrays	81	13694	August 9, 2024

Best approach for runtime dispatching inside a hot loop (heterogeneous tree structure)

Related topics