# How to write type-stable dataflow graphs

Iβm trying to implement a dataflow graph for basic arithmetic and encountering type instability for all but trivially simple graphs. How to rewrite this in a type-stable way?

``````module Inst # for "type-instability"
import Base.:*

abstract type Op end # trait; indicates the arithmetic operation performed
struct Sum <: Op end
struct Prod <: Op end

struct Node{S<:Op, C} # a node in the graph; C is the type of the children.
children::Vector{C}
end
Node(T::Type{S}, children::Vector{C}) where {S<:Op, C}= Node{T, C}(children)
*(g::Node{S, C}, xx) where {S, C}= Node{S,C}([xx*tt for tt in g.children])
*(xx, g::Node)= *(g, xx)

# apply executes the operation
apply(g::Node{S, C}) where {S <: Op,C}= apply(S(), g)
apply(::Sum, g)= sum(apply.(g.children))
apply(::Prod, g)= prod(apply.(g.children)) # recurse apply into children
apply(x::T) where {T<:Number}= x # evaluate leaf-nodes of graph
end

using .Inst
# verify correctness of implementation...
s= Inst.Node(Inst.Sum, [1,2])
@show Inst.apply(s)
p= Inst.Node(Inst.Prod, [3,4])
@show Inst.apply(p)
g= Inst.Node(Inst.Prod, [s, p])
@show Inst.apply(g)

@code_warntype Inst.apply(s) # type stable
@code_warntype Inst.apply(p) # type stable
julia> @code_warntype Inst.apply(g) #<-- why is there type instability here?
Variables
#self#::Core.Compiler.Const(Main.Inst.apply, false)
g::Main.Inst.Node{Main.Inst.Prod,Main.Inst.Node{S,Int64} where S<:Main.Inst.Style}

Body::Any
1 β %1 = (\$(Expr(:static_parameter, 1)))()::Core.Compiler.Const(Main.Inst.Prod(), false)
β   %2 = Main.Inst.apply(%1, g)::Any
βββ      return %2

``````

Profiling shows that `apply(g)` is much slower, presumably due to type instability.

``````julia> using BenchmarkTools

julia> @btime Inst.apply(s) ;
@btime Inst.apply(g) ;
66.834 ns (1 allocation: 96 bytes)

julia> @btime Inst.apply(g) ;
251.530 ns (5 allocations: 320 bytes)

``````

Can you say more? It seems like your design is to use separate types for things like `Sum` and `Prod` that presumably vary a lot at runtime, so it seems like your intention is to incur lots of runtime dispatch costs. Whatβs the type instability that youβre focused on getting rid of?

1 Like

Thanks for having a look. Iβve updated above with btimes for each, and the output of @code_warntype. Iβm simply trying to speed up the call to `apply(g)`.

You seem to be suggesting that this problem is inherently type-unstable? But doesnβt Julia solve this problem when it compiles regular nested arithmetic expressions, such as `x=3*(4+2)+7`?

I think you want to remove the type based dispatch if you want this to be faster: instead of something like Op, Sum and Prod as distinct types, you might make a single type that uses an enum to capture the expression and branch on that.

I donβt quite understand what you mean here? Are you implicitly assuming that Juliaβs internal representation of arithmetic looks like what youβre doing?

2 Likes

Fascinating. Iβll give it a try. Somehow I have adopted the view that type-based dispatch always means faster runtime performance (because it enables the interpreter to avoid method-lookup). A misunderstanding on my part? Thanks for your help.

That seems like a misunderstanding. I believe the right rule is something closer to: if you can do X at compile time, you pay the cost at compile time and not at run time. If you canβt do X at compile time, you pay the cost at run time. The cost of runtime type dispatch is likely higher than the cost of a simple three-way branch on an integer code, although the details of the dispatch matter a lot: [ANN] ManualDispatch.jl

In addition, I think method lookup and type-based dispatch are the same thing under most definitions of those terms.

2 Likes

This can work. A nice exposition of the problem and this approach (or others) would be useful.

1 Like

Yes, commenting here made me realize Iβm not on top of the landscape of options anymore or the relative performance.

I tried this, but it didnβt help as much as I hoped. Although it removed the `Any`s in @code_warntype, run time and memory allocation are similar to the original implementation. As @johnmyleswhite suggested, the new implementation dispatches based on the value of an enum:

``````module Inst # for "type-instability"
import Base.:*

@enum Op Sum Prod

struct Node{C} # a node in the graph; C is the type of the children.
op::Op
children::Vector{C}
end

# apply executes the operation
apply(x::T) where {T<:Number}= x
op(g::Node)= g.op
apply(g::Node{C}) where {C}= apply(op(g), g)
apply(Sum, g)= sum(apply.(g.children))
apply(Prod, g)= prod(apply.(g.children)) # recurse apply into children
# apply(x::T) where {T<:Number}= x # evaluate leaf-nodes of graph
end

using .Inst
# verify correctness of implementation...
s= Inst.Node(Inst.Sum, [1,2])
@show Inst.apply(s)
p= Inst.Node(Inst.Prod, [3,4])
@show Inst.apply(p)
g= Inst.Node(Inst.Prod, [s, p])
@show Inst.apply(g)

using BenchmarkTools
@btime Inst.apply(\$s) ;
@btime Inst.apply(\$g) ;
``````

gives

``````  44.344 ns (1 allocation: 96 bytes)
135.021 ns (3 allocations: 288 bytes)
``````

There is a modest improvement for `apply(\$g)`, but not as much as I hoped; also substantial memory allocations remain. Compare with the original implementation for these two operations, which Iβve transcribed from above.

``````julia> @btime Inst.apply(s) ;
66.834 ns (1 allocation: 96 bytes)

julia> @btime Inst.apply(g) ;
251.530 ns (5 allocations: 320 bytes)
``````

What am I doing wrong?? Thanks!

Iβm a bit pressed for time to fully debug your issues, but hereβs an example of an approach without the concerns youβre raising:

``````@enum Op identity_op sum_op prod_op

struct Node
op::Op
direct_value::Int
children::Vector{Node}
end

function apply(n::Node)
if n.op === identity_op
n.direct_value
elseif n.op === sum_op
sum((apply(c) for c in n.children))
else
prod((apply(c) for c in n.children))
end
end

s = Node(
sum_op,
0,
Node[
Node(
identity_op,
1,
Node[],
),
Node(
identity_op,
2,
Node[],
),
]
)

@show apply(s)

p = Node(
prod_op,
0,
Node[
Node(
identity_op,
3,
Node[],
),
Node(
identity_op,
4,
Node[],
),
]
)

@show apply(p)

g = Node(
prod_op,
0,
Node[
s,
p,
]
)

@show apply(g)

using BenchmarkTools
@btime apply(\$s);
@btime apply(\$g);
``````

This has the following results:

``````julia> @btime apply(\$s);
140.432 ns (4 allocations: 80 bytes)

julia> @btime apply(\$g);
221.732 ns (4 allocations: 80 bytes)
``````

Focus your effort on debugging on two main differences from your implementation:

1. Every Node has children of type Node, so traversing the children canβt introduce variation in types.
2. `apply` is explicitly written in terms of a normal branch, not type-based dispatch.

My code isnβt particularly nice and could be improve a lot, but it should show you that some of the issues youβre seeing are intrinsic to specific design choices youβve made that make runtime type dispatch inevitable.

4 Likes

Thank you.

It appears there is still type-instability here?

``````julia> @code_warntype apply(s)
Variables
#self#::Core.Compiler.Const(apply, false)
n::Node

Body::Any
1 β %1  = Base.getproperty(n, :op)::Op
β   %2  = (%1 === Main.identity_op)::Bool
βββ       goto #3 if not %2
2 β %4  = Base.getproperty(n, :direct_value)::Int64
βββ       return %4
3 β %6  = Base.getproperty(n, :op)::Op
β   %7  = (%6 === Main.sum_op)::Bool
βββ       goto #5 if not %7
4 β %9  = Base.getproperty(n, :children)::Array{Node,1}
β   %10 = Base.Generator(Main.apply, %9)::Base.Generator{Array{Node,1},typeof(apply)}
β   %11 = Main.sum(%10)::Any
βββ       return %11
5 β %13 = Base.getproperty(n, :children)::Array{Node,1}
β   %14 = Base.Generator(Main.apply, %13)::Base.Generator{Array{Node,1},typeof(apply)}
β   %15 = Main.prod(%14)::Any
βββ       return %15
``````

That is surprising, but I suppose that might be what happens if one of the children arrays is empty. In practice, you throw an exception in that case, but ruling it out by using an explicit loop does fix it:

``````julia> @enum Op identity_op sum_op prod_op

julia> struct Node
op::Op
direct_value::Int
children::Vector{Node}
end

julia> function apply(n::Node)
if n.op === identity_op
n.direct_value
elseif n.op === sum_op
s = 0
for c in n.children
s += apply(c)
end
s
else
p = 1
for c in n.children
p *= apply(c)
end
p
end
end
apply (generic function with 1 method)

julia> s = Node(
sum_op,
0,
Node[
Node(
identity_op,
1,
Node[],
),
Node(
identity_op,
2,
Node[],
),
]
)
Node(sum_op, 0, Node[Node(identity_op, 1, Node[]), Node(identity_op, 2, Node[])])

julia> @show apply(s)
apply(s) = 3
3

julia> p = Node(
prod_op,
0,
Node[
Node(
identity_op,
3,
Node[],
),
Node(
identity_op,
4,
Node[],
),
]
)
Node(prod_op, 0, Node[Node(identity_op, 3, Node[]), Node(identity_op, 4, Node[])])

julia> @show apply(p)
apply(p) = 12
12

julia> g = Node(
prod_op,
0,
Node[
s,
p,
]
)
Node(prod_op, 0, Node[Node(sum_op, 0, Node[Node(identity_op, 1, Node[]), Node(identity_op, 2, Node[])]), Node(prod_op, 0, Node[Node(identity_op, 3, Node[]), Node(identity_op, 4, Node[])])])

julia> @show apply(g)
apply(g) = 36
36

julia> using BenchmarkTools

julia> @btime apply(\$s);
14.620 ns (0 allocations: 0 bytes)

julia> @btime apply(\$g);
34.704 ns (0 allocations: 0 bytes)

julia> @code_warntype apply(s)
Variables
#self#::Core.Const(apply)
n::Node
@_3::Union{Nothing, Tuple{Node, Int64}}
@_4::Union{Nothing, Tuple{Node, Int64}}
p::Int64
s::Int64
c@_7::Node
c@_8::Node

Body::Int64
1 ββ       Core.NewvarNode(:(@_3))
β          Core.NewvarNode(:(@_4))
β          Core.NewvarNode(:(p))
β          Core.NewvarNode(:(s))
β    %5  = Base.getproperty(n, :op)::Op
β    %6  = (%5 === Main.identity_op)::Bool
ββββ       goto #3 if not %6
2 ββ %8  = Base.getproperty(n, :direct_value)::Int64
ββββ       return %8
3 ββ %10 = Base.getproperty(n, :op)::Op
β    %11 = (%10 === Main.sum_op)::Bool
ββββ       goto #8 if not %11
4 ββ       (s = 0)
β    %14 = Base.getproperty(n, :children)::Vector{Node}
β          (@_4 = Base.iterate(%14))
β    %16 = (@_4 === nothing)::Bool
β    %17 = Base.not_int(%16)::Bool
ββββ       goto #7 if not %17
5 ββ %19 = @_4::Tuple{Node, Int64}::Tuple{Node, Int64}
β          (c@_7 = Core.getfield(%19, 1))
β    %21 = Core.getfield(%19, 2)::Int64
β    %22 = s::Int64
β    %23 = Main.apply(c@_7)::Int64
β          (s = %22 + %23)
β          (@_4 = Base.iterate(%14, %21))
β    %26 = (@_4 === nothing)::Bool
β    %27 = Base.not_int(%26)::Bool
ββββ       goto #7 if not %27
6 ββ       goto #5
7 ββ       return s
8 ββ       (p = 1)
β    %32 = Base.getproperty(n, :children)::Vector{Node}
β          (@_3 = Base.iterate(%32))
β    %34 = (@_3 === nothing)::Bool
β    %35 = Base.not_int(%34)::Bool
ββββ       goto #11 if not %35
9 ββ %37 = @_3::Tuple{Node, Int64}::Tuple{Node, Int64}
β          (c@_8 = Core.getfield(%37, 1))
β    %39 = Core.getfield(%37, 2)::Int64
β    %40 = p::Int64
β    %41 = Main.apply(c@_8)::Int64
β          (p = %40 * %41)
β          (@_3 = Base.iterate(%32, %39))
β    %44 = (@_3 === nothing)::Bool
β    %45 = Base.not_int(%44)::Bool
ββββ       goto #11 if not %45
10 β       goto #9
11 β       return p
``````
2 Likes

that works! Thanks!

1 Like

While you have an acceptable answer here (branched dispatch based on type checks), you achieve what you want which is full compile time optimisation of the graph. The secret is to not use a Vector to hold children that are also execution nodes, but rather a Tuple that you can specialise on for dispatch. A vector always resolved to a single type for all elements in the container, whereas a Tuple is concretely typed for each element.

I suggest looking at Chain from Flux for some ideas (Flux.jl/basic.jl at master Β· FluxML/Flux.jl Β· GitHub), plus some of the answers I was provided here Executing array of functions much slower than executing functions in series

The intuition to develop for Julia is that containers of heterogeneous types for method dispatch will always come at a cost and there are three options:

1. Use FunctionWrappers or similar to stabilise the function interface, at the cost of no possible inlining
2. Manually dispatch based on types (i.e. write your own basic interpreter)
3. Lift the entire execution graph into the type system, usually by using Tuples and dispatching on each peeled off element

There are other strategies Iβm sure but thatβs what Iβve run in to. Strategy 3 is highest compile time overhead, lowest ringtone overhead, but also most difficult to grok

1 Like