How to write type-stable dataflow graphs

I’m trying to implement a dataflow graph for basic arithmetic and encountering type instability for all but trivially simple graphs. How to rewrite this in a type-stable way?

module Inst # for "type-instability" 
import Base.:* 

abstract type Op end # trait; indicates the arithmetic operation performed 
struct Sum <: Op end 
struct Prod <: Op end 

struct Node{S<:Op, C} # a node in the graph; C is the type of the children. 
    children::Vector{C} 
end 
Node(T::Type{S}, children::Vector{C}) where {S<:Op, C}= Node{T, C}(children) 
*(g::Node{S, C}, xx) where {S, C}= Node{S,C}([xx*tt for tt in g.children])
*(xx, g::Node)= *(g, xx) 

# apply executes the operation 
apply(g::Node{S, C}) where {S <: Op,C}= apply(S(), g) 
apply(::Sum, g)= sum(apply.(g.children)) 
apply(::Prod, g)= prod(apply.(g.children)) # recurse apply into children 
apply(x::T) where {T<:Number}= x # evaluate leaf-nodes of graph 
end 

using .Inst 
# verify correctness of implementation...
s= Inst.Node(Inst.Sum, [1,2])
@show Inst.apply(s) 
p= Inst.Node(Inst.Prod, [3,4])
@show Inst.apply(p) 
g= Inst.Node(Inst.Prod, [s, p])
@show Inst.apply(g) 

@code_warntype Inst.apply(s) # type stable 
@code_warntype Inst.apply(p) # type stable 
julia> @code_warntype Inst.apply(g) #<-- why is there type instability here? 
Variables
  #self#::Core.Compiler.Const(Main.Inst.apply, false)
  g::Main.Inst.Node{Main.Inst.Prod,Main.Inst.Node{S,Int64} where S<:Main.Inst.Style}

Body::Any
1 ─ %1 = ($(Expr(:static_parameter, 1)))()::Core.Compiler.Const(Main.Inst.Prod(), false)
β”‚   %2 = Main.Inst.apply(%1, g)::Any
└──      return %2

Profiling shows that apply(g) is much slower, presumably due to type instability.

julia> using BenchmarkTools 

julia> @btime Inst.apply(s) ; 
@btime Inst.apply(g) ; 
  66.834 ns (1 allocation: 96 bytes)

julia> @btime Inst.apply(g) ; 
  251.530 ns (5 allocations: 320 bytes)

Can you say more? It seems like your design is to use separate types for things like Sum and Prod that presumably vary a lot at runtime, so it seems like your intention is to incur lots of runtime dispatch costs. What’s the type instability that you’re focused on getting rid of?

1 Like

Thanks for having a look. I’ve updated above with btimes for each, and the output of @code_warntype. I’m simply trying to speed up the call to apply(g).

You seem to be suggesting that this problem is inherently type-unstable? But doesn’t Julia solve this problem when it compiles regular nested arithmetic expressions, such as x=3*(4+2)+7?

I think you want to remove the type based dispatch if you want this to be faster: instead of something like Op, Sum and Prod as distinct types, you might make a single type that uses an enum to capture the expression and branch on that.

I don’t quite understand what you mean here? Are you implicitly assuming that Julia’s internal representation of arithmetic looks like what you’re doing?

2 Likes

Fascinating. I’ll give it a try. Somehow I have adopted the view that type-based dispatch always means faster runtime performance (because it enables the interpreter to avoid method-lookup). A misunderstanding on my part? Thanks for your help.

That seems like a misunderstanding. I believe the right rule is something closer to: if you can do X at compile time, you pay the cost at compile time and not at run time. If you can’t do X at compile time, you pay the cost at run time. The cost of runtime type dispatch is likely higher than the cost of a simple three-way branch on an integer code, although the details of the dispatch matter a lot: [ANN] ManualDispatch.jl

In addition, I think method lookup and type-based dispatch are the same thing under most definitions of those terms.

2 Likes

This can work. A nice exposition of the problem and this approach (or others) would be useful.

1 Like

Yes, commenting here made me realize I’m not on top of the landscape of options anymore or the relative performance.

I tried this, but it didn’t help as much as I hoped. Although it removed the Anys in @code_warntype, run time and memory allocation are similar to the original implementation. As @johnmyleswhite suggested, the new implementation dispatches based on the value of an enum:

module Inst # for "type-instability" 
import Base.:* 

@enum Op Sum Prod 

struct Node{C} # a node in the graph; C is the type of the children. 
    op::Op 
    children::Vector{C} 
end 

# apply executes the operation 
apply(x::T) where {T<:Number}= x
op(g::Node)= g.op 
apply(g::Node{C}) where {C}= apply(op(g), g) 
apply(Sum, g)= sum(apply.(g.children)) 
apply(Prod, g)= prod(apply.(g.children)) # recurse apply into children 
# apply(x::T) where {T<:Number}= x # evaluate leaf-nodes of graph 
end 

using .Inst 
# verify correctness of implementation...
s= Inst.Node(Inst.Sum, [1,2])
@show Inst.apply(s) 
p= Inst.Node(Inst.Prod, [3,4])
@show Inst.apply(p) 
g= Inst.Node(Inst.Prod, [s, p])
@show Inst.apply(g) 

using BenchmarkTools 
@btime Inst.apply($s) ; 
@btime Inst.apply($g) ; 

gives

  44.344 ns (1 allocation: 96 bytes)
  135.021 ns (3 allocations: 288 bytes) 

There is a modest improvement for apply($g), but not as much as I hoped; also substantial memory allocations remain. Compare with the original implementation for these two operations, which I’ve transcribed from above.

julia> @btime Inst.apply(s) ; 
  66.834 ns (1 allocation: 96 bytes)

julia> @btime Inst.apply(g) ; 
  251.530 ns (5 allocations: 320 bytes)

What am I doing wrong?? Thanks!

I’m a bit pressed for time to fully debug your issues, but here’s an example of an approach without the concerns you’re raising:

@enum Op identity_op sum_op prod_op

struct Node
    op::Op
    direct_value::Int
    children::Vector{Node}
end

function apply(n::Node)
    if n.op === identity_op
        n.direct_value
    elseif n.op === sum_op
        sum((apply(c) for c in n.children))
    else
        prod((apply(c) for c in n.children))
    end
end

s = Node(
    sum_op,
    0,
    Node[
        Node(
            identity_op,
            1,
            Node[],
        ),
        Node(
            identity_op,
            2,
            Node[],
        ),
    ]
)

@show apply(s) 

p = Node(
    prod_op,
    0,
    Node[
        Node(
            identity_op,
            3,
            Node[],
        ),
        Node(
            identity_op,
            4,
            Node[],
        ),
    ]
)

@show apply(p)

g = Node(
    prod_op,
    0,
    Node[
        s,
        p,
    ]
)

@show apply(g)

using BenchmarkTools 
@btime apply($s);
@btime apply($g);

This has the following results:

julia> @btime apply($s);
  140.432 ns (4 allocations: 80 bytes)

julia> @btime apply($g);
  221.732 ns (4 allocations: 80 bytes)

Focus your effort on debugging on two main differences from your implementation:

  1. Every Node has children of type Node, so traversing the children can’t introduce variation in types.
  2. apply is explicitly written in terms of a normal branch, not type-based dispatch.

My code isn’t particularly nice and could be improve a lot, but it should show you that some of the issues you’re seeing are intrinsic to specific design choices you’ve made that make runtime type dispatch inevitable.

4 Likes

Thank you.

It appears there is still type-instability here?

julia> @code_warntype apply(s)
Variables
  #self#::Core.Compiler.Const(apply, false)
  n::Node

Body::Any
1 ─ %1  = Base.getproperty(n, :op)::Op
β”‚   %2  = (%1 === Main.identity_op)::Bool
└──       goto #3 if not %2
2 ─ %4  = Base.getproperty(n, :direct_value)::Int64
└──       return %4
3 ─ %6  = Base.getproperty(n, :op)::Op
β”‚   %7  = (%6 === Main.sum_op)::Bool
└──       goto #5 if not %7
4 ─ %9  = Base.getproperty(n, :children)::Array{Node,1}
β”‚   %10 = Base.Generator(Main.apply, %9)::Base.Generator{Array{Node,1},typeof(apply)}
β”‚   %11 = Main.sum(%10)::Any
└──       return %11
5 ─ %13 = Base.getproperty(n, :children)::Array{Node,1}
β”‚   %14 = Base.Generator(Main.apply, %13)::Base.Generator{Array{Node,1},typeof(apply)}
β”‚   %15 = Main.prod(%14)::Any
└──       return %15

That is surprising, but I suppose that might be what happens if one of the children arrays is empty. In practice, you throw an exception in that case, but ruling it out by using an explicit loop does fix it:

julia> @enum Op identity_op sum_op prod_op

julia> struct Node
           op::Op
           direct_value::Int
           children::Vector{Node}
       end

julia> function apply(n::Node)
           if n.op === identity_op
               n.direct_value
           elseif n.op === sum_op
               s = 0
               for c in n.children
                   s += apply(c)
               end
               s
           else
               p = 1
               for c in n.children
                   p *= apply(c)
               end
               p
           end
       end
apply (generic function with 1 method)

julia> s = Node(
           sum_op,
           0,
           Node[
               Node(
                   identity_op,
                   1,
                   Node[],
               ),
               Node(
                   identity_op,
                   2,
                   Node[],
               ),
           ]
       )
Node(sum_op, 0, Node[Node(identity_op, 1, Node[]), Node(identity_op, 2, Node[])])

julia> @show apply(s)
apply(s) = 3
3

julia> p = Node(
           prod_op,
           0,
           Node[
               Node(
                   identity_op,
                   3,
                   Node[],
               ),
               Node(
                   identity_op,
                   4,
                   Node[],
               ),
           ]
       )
Node(prod_op, 0, Node[Node(identity_op, 3, Node[]), Node(identity_op, 4, Node[])])

julia> @show apply(p)
apply(p) = 12
12

julia> g = Node(
           prod_op,
           0,
           Node[
               s,
               p,
           ]
       )
Node(prod_op, 0, Node[Node(sum_op, 0, Node[Node(identity_op, 1, Node[]), Node(identity_op, 2, Node[])]), Node(prod_op, 0, Node[Node(identity_op, 3, Node[]), Node(identity_op, 4, Node[])])])

julia> @show apply(g)
apply(g) = 36
36

julia> using BenchmarkTools

julia> @btime apply($s);
  14.620 ns (0 allocations: 0 bytes)

julia> @btime apply($g);
  34.704 ns (0 allocations: 0 bytes)

julia> @code_warntype apply(s)
Variables
  #self#::Core.Const(apply)
  n::Node
  @_3::Union{Nothing, Tuple{Node, Int64}}
  @_4::Union{Nothing, Tuple{Node, Int64}}
  p::Int64
  s::Int64
  c@_7::Node
  c@_8::Node

Body::Int64
1 ──       Core.NewvarNode(:(@_3))
β”‚          Core.NewvarNode(:(@_4))
β”‚          Core.NewvarNode(:(p))
β”‚          Core.NewvarNode(:(s))
β”‚    %5  = Base.getproperty(n, :op)::Op
β”‚    %6  = (%5 === Main.identity_op)::Bool
└───       goto #3 if not %6
2 ── %8  = Base.getproperty(n, :direct_value)::Int64
└───       return %8
3 ── %10 = Base.getproperty(n, :op)::Op
β”‚    %11 = (%10 === Main.sum_op)::Bool
└───       goto #8 if not %11
4 ──       (s = 0)
β”‚    %14 = Base.getproperty(n, :children)::Vector{Node}
β”‚          (@_4 = Base.iterate(%14))
β”‚    %16 = (@_4 === nothing)::Bool
β”‚    %17 = Base.not_int(%16)::Bool
└───       goto #7 if not %17
5 ┄─ %19 = @_4::Tuple{Node, Int64}::Tuple{Node, Int64}
β”‚          (c@_7 = Core.getfield(%19, 1))
β”‚    %21 = Core.getfield(%19, 2)::Int64
β”‚    %22 = s::Int64
β”‚    %23 = Main.apply(c@_7)::Int64
β”‚          (s = %22 + %23)
β”‚          (@_4 = Base.iterate(%14, %21))
β”‚    %26 = (@_4 === nothing)::Bool
β”‚    %27 = Base.not_int(%26)::Bool
└───       goto #7 if not %27
6 ──       goto #5
7 ┄─       return s
8 ──       (p = 1)
β”‚    %32 = Base.getproperty(n, :children)::Vector{Node}
β”‚          (@_3 = Base.iterate(%32))
β”‚    %34 = (@_3 === nothing)::Bool
β”‚    %35 = Base.not_int(%34)::Bool
└───       goto #11 if not %35
9 ┄─ %37 = @_3::Tuple{Node, Int64}::Tuple{Node, Int64}
β”‚          (c@_8 = Core.getfield(%37, 1))
β”‚    %39 = Core.getfield(%37, 2)::Int64
β”‚    %40 = p::Int64
β”‚    %41 = Main.apply(c@_8)::Int64
β”‚          (p = %40 * %41)
β”‚          (@_3 = Base.iterate(%32, %39))
β”‚    %44 = (@_3 === nothing)::Bool
β”‚    %45 = Base.not_int(%44)::Bool
└───       goto #11 if not %45
10 ─       goto #9
11 β”„       return p
2 Likes

that works! Thanks!

1 Like

While you have an acceptable answer here (branched dispatch based on type checks), you achieve what you want which is full compile time optimisation of the graph. The secret is to not use a Vector to hold children that are also execution nodes, but rather a Tuple that you can specialise on for dispatch. A vector always resolved to a single type for all elements in the container, whereas a Tuple is concretely typed for each element.

I suggest looking at Chain from Flux for some ideas (Flux.jl/basic.jl at master Β· FluxML/Flux.jl Β· GitHub), plus some of the answers I was provided here Executing array of functions much slower than executing functions in series

The intuition to develop for Julia is that containers of heterogeneous types for method dispatch will always come at a cost and there are three options:

  1. Use FunctionWrappers or similar to stabilise the function interface, at the cost of no possible inlining
  2. Manually dispatch based on types (i.e. write your own basic interpreter)
  3. Lift the entire execution graph into the type system, usually by using Tuples and dispatching on each peeled off element

There are other strategies I’m sure but that’s what I’ve run in to. Strategy 3 is highest compile time overhead, lowest ringtone overhead, but also most difficult to grok :slight_smile:

1 Like