Strange performance issue on 1.9.0-rc3

I’m seeing a weird performance issue related to the presence of an if-statement in my function, regardless of the if-statement being triggered or not. I can only reproduce this in 1.9.0, not earlier Julia versions.

I have a function called tree_mapreduce which has two different modes of behavior configured by the preserve_sharing flag:

function tree_mapreduce(
    args...;
    preserve_sharing::Bool=false,
    result_type::Type{RT}=Nothing,
)
    if preserve_sharing && result_type != Nothing
        return _tree_mapreduce(args..., IdDict{Node,RT}())
    else
        return _tree_mapreduce(args...)
    end
end

(simplified version; here is the full implementation)

This function is used for copies of my Node{T} type in the copy_node function. When it has an extra argument, it has slightly different behavior. With this implementation, I get the following benchmark times:

using DynamicExpressions

# Create expression tree
operators = OperatorEnum(; binary_operators=[+, -, *, /], unary_operators=[cos, sin])
x1, x2, x3 = (i -> Node(Float64; feature=i).(1:3)
tree = cos(x1 * 3.2 - 5.8) * 0.2 - 0.5 * x2 * x3 * x3 + 0.9 / (x1 * x1 + 1)

# Benchmark:
@btime copy_node(tree; preserve_sharing=false)

This gives me 682.752 ns (26 allocations: 2.02 KiB).

Note that preserve_sharing is set to false.

However, when I simply change the function to be

function tree_mapreduce(
    f_leaf::F1,
    f_branch::F2,
    op::G,
    tree::N;
    preserve_sharing::Bool=false,
    result_type::Type{RT}=Nothing,
) where {T,N<:Node{T},F1<:Function,F2<:Function,G<:Function,RT}
    #if preserve_sharing
    #    return _tree_mapreduce(f_leaf, f_branch, op, tree, IdDict{N,RT}())
    #else
    return _tree_mapreduce(f_leaf, f_branch, op, tree)
    #end
end

and re-run the benchmark, I obtain the result 285.015 ns (24 allocations: 1.88 KiB)

Why is the simple presence of that if-statement harming performance so dramatically? I note that for preserve_sharing=true, the performance is significantly higher, so I don’t think that branch is actually being executed.


Additional notes:

  • I ran the above benchmarks with -O3. But running without seems to make the second benchmark slightly worse, and thus the times slightly closer together in performance.
  • I ran the benchmarks with the return type annotated on tree_mapreduce, which didn’t seem to change the performance.
  • On Julia 1.8.5, the timings are nearly identical.
  • Another weird thing I noticed is that if I am using Revise.jl, and make this edit, and then go back, the time of about ~700 ns will go down to ~460 ns, even though it’s the identical function. Even if I call the function many many times in between. It’s almost like the compilation is improved by Revising the function multiple times.

Edit: added the full function signature as it appears relevant

1 Like

Probably not related, but shouldn’t tree be interpolated here for the benchmark?

1 Like

You are correct. Sadly it doesn’t fix the weird results though.

1 Like

More weirdness (/clues). If I manually expand the contents of copy_node, the performance gap goes away:

@btime(
  tree_mapreduce(
    t -> t.constant ? Node(; val=t.val::Float64) : Node(Float64; feature=t.feature),
    identity,
    (p, c...) -> Node(p.op, c...),
    ctree;
    preserve_sharing=false
  ),
  setup=(ctree=copy_node($tree))
)
# ~ 282 ns

BUT if I also manually pass the result_type, the performance gap comes back:

@btime(
  tree_mapreduce(
    t -> t.constant ? Node(; val=t.val::Float64) : Node(Float64; feature=t.feature),
    identity,
    (p, c...) -> Node(p.op, c...),
    ctree;
    preserve_sharing=false,
    result_type=Nothing
  ),
  setup=(ctree=copy_node($tree))
)
# ~474 ns

which is very strange because result_type=Nothing is the default value!!

I only glanced, but the implementation is clearly recursive. If types can change, inference hates recursion. Anything that hoists the final type into a caller will suddenly make inference much happier.

Try running Cthulhu over this and fix any inferrability problems you see. I think @jishnub has been doing the exact same task over the ApproxFun ecosystem with excellent results (and who has been a faithful bug reporter in the Cthulhu repository, he’s a real hero for it).

3 Likes

Thanks @tim.holy, I will try it out. The return type is (should be) constant throughout the recursion, but perhaps there is a part I need to annotate to help the compiler.

Okay I think I might have figured out the issue thanks to Cthulhu:

I did not realize this, but apparently Julia does not perform multiple dispatch on keyword arguments? Is that (still) true?

If so then I suppose I need to specify this result_type as a normal argument, rather than a kwarg. Otherwise it thinks the return type is Any, because result_type could be anything.

Yes, this is true, and I do not believe it will ever change, even in Julia 2.0 (which may never come). However, you can annotate the type of a keyword parameter, so to say to Julia that, in that method body, no other type could be passed.

1 Like

It does specialize on kwargs, in the sense that supplying kwarg=1 and kwarg=1.0 will generate different compiled code. Likewise, as Henrique says you can constrain the types of kwargs. But you can’t dispatch on them (dispatch=control which method gets called) because you can’t have two Methods with identical positional arguments but different sets of kwargs—and when there’s only one method, there’s nothing to dispatch on.

In this case you’re probably encountering weird issues with specialization heuristics. Julia specializes on x by compiling a version for typeof(x), but when x is a Type then typeof(x) is DataType which seems not to be what you want to happen here. The Type{RT} is the escape-hatch, but in this case things are confusing because return_type and RT are nominally identical yet may be subject to different specialization heuristics and behavior with respect to this escape hatch. In this case, indeed it seems that the requirement to pass kwargs by name is causing the issue.

2 Likes

Thanks @Henrique_Becker and @tim.holy, that is helpful to know. That is indeed more complex of an issue than I was expecting. I’m glad we figured it out though.

For posterity, I see the speeds go back to normal with the following modification:

 function tree_mapreduce(
     args...,
+    result_type::Type{RT}=Nothing,
     ;
     preserve_sharing::Bool=false,
-    result_type::Type{RT}=Nothing,
 )
     if preserve_sharing && result_type != Nothing
         return _tree_mapreduce(args..., IdDict{Node,RT}())
     else
         return _tree_mapreduce(args...)
     end
 end

which gives me the same speeds as if the branching code was commented out.

4 Likes