Splatting arguments causes ~30x slow down

cnliao · October 30, 2018, 12:36pm

I found the performance characteristic not intuitive when trying to splatting arguments into function calls. On Julia v1.0.1 Ubuntu 16.04 I get the following benchmark and numbers:

using BenchmarkTools

function k1(x1, x2, x3) # a kernel
    x1 + x2 * x3
end

function apply0!(f, y, args...) # wish this could be fast
    for i in eachindex(args[1])
        # y[i*2 - i%2] = f(getindex.(args, i)...) # slower than map
        y[i*2 - i%2] = f(map(a->a[i], args)...)  # strange y index to make it not trivially @. broadcastable
    end
end

function apply1!(f, y, args...) # this is not generic but fast
    for i in eachindex(args[1])
        y[i*2 - i%2] = f(args[1][i], args[2][i], args[3][i])
    end
end

function apply2!(f, y, x1, x2, x3) # same performance as apply1! but even less generic
    for i in eachindex(x1)
        y[i*2 - i%2] = f(x1[i], x2[i], x3[i])
    end
end

function apply3!(f, y, args...) # performance even worse although ith_all is inferable
    for i in eachindex(args[1])
        y[i*2 - i%2] = f(Base.ith_all(i, args)...)
    end
end

function apply4!(f, y, args...) # suprisingly this doesn't help either since _map_i should be dispatch to the 3-args variant.
    for i in eachindex(args[1])
        y[i*2 - i%2] = _map_i(f, i, args...)
    end
end

@inline _map_i(f, i, x1) = f(x1[i])
@inline _map_i(f, i, x1, x2) = f(x1[i], x2[i])
@inline _map_i(f, i, x1, x2, x3) = f(x1[i], x2[i], x3[i])
@inline _map_i(f, i, x1, x2, x3, x4, args...) = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)

n1, n2 = 1000,1000
x1 = randn(n1,n2)
x2 = randn(n1,n2)
x3 = randn(n1,n2)
y  = similar(x1, n1*2, n2)

@btime apply0!($k1, $y, $x1, $x2, $x3) # 110.086 ms (6999746 allocations: 122.07 MiB)
@btime apply1!($k1, $y, $x1, $x2, $x3) # 3.254 ms (0 allocations: 0 bytes)
@btime apply2!($k1, $y, $x1, $x2, $x3) # 3.209 ms (0 allocations: 0 bytes)
@btime apply3!($k1, $y, $x1, $x2, $x3) # 731.212 ms (12999234 allocations: 244.13 MiB)
@btime apply4!($k1, $y, $x1, $x2, $x3) # 230.310 ms (4998724 allocations: 76.27 MiB)

I can use generated function for this specific usage for now. But I think this performance penalty is quite surprising to new comers. Generated functions also introduce other drawbacks (not (Revise.jl)able, may fail PackageCompile, longer load time, etc.). So here are the questions:

Is there a way to avoid the performance penalty without using generated functions?
Is there a mental model that can help one predict the performance of the generated Julia code? In the above example all apply#! versions are type stable, and they look almost identical to @code_warntype. However they have vastly different performance. In particular apply1 apply3 and apply4 are all slow, but what makes one even slower than the other?
As a newbie I often find it hard to write code that is BOTH generic and performant without resorting to metaprogramming. What tricks may I be missing? To name a few that I more or less know about: the Holy trait and the recursive-tail pattern.

tim.holy · October 30, 2018, 4:00pm

Great question! You are running up against the specialization heuristics of the compiler: basically, the question is “should Julia compile a specialized version of every single function for every single argument combination?” In some cases that’s counterproductive, so at a certain point Julia’s compiler decides to punt and use runtime dispatch.

Here you have two challenges to specialization:

splatting
the f argument (which is an arbitrary function)

To be clear, to get ideal performance it has to compile a separate version for every combination of args types and for every different f, and in this case either one is enough on its own to prevent specialization (especially because f is being called varargs, that might not happen in other cases). Consequently, you have to “solve” both problems to see great performance.

The following modifications suffice for me:

function apply4!(f::F, y, args::Vararg{T,N}) where {F,T,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = _map_i(f, i, args...)
    end
    return y
end

@inline _map_i(f::F, i, x1) where F = f(x1[i])
@inline _map_i(f::F, i, x1, x2) where F = f(x1[i], x2[i])
@inline _map_i(f::F, i, x1, x2, x3) where F = f(x1[i], x2[i], x3[i])
@inline _map_i(f::F, i, x1, x2, x3, x4, args...) where F = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)

I did two things here:

the f::F syntax seems useless, but it turns out to force specialization on the function argument
the ::Vararg{T,N} forces it to specialize the function for all different numbers of input arguments.

With these two changes I get the same performance from apply4! that I get from apply1! and apply2!.

This might be more intuitive if we had a @specialize macro, so that one could write

function apply4!(@specialize(f), y, @specialize(args...))
    ...
end

ssfrr · October 30, 2018, 5:05pm

I agree - I think there’s pretty consistent messaging that specifying types of function arguments is only for dispatch and documentation, not performance, but this seems to be a counterexample. It would also be nicely symmetric with @nospecialize

cnliao · October 31, 2018, 2:28am

Thank you for the excellent insight! I do a few more tests based on your comment and have the following findings. Code samples appended.

All apply#! variants benefit from the type annotation. Changing their signature to the following brings all to the same level of performance of apply2!.
function apply!(f::F, y, args::Vararg{Any,N}) where{F,N}
Even @generated version of apply! (apply5!) has poor performance without the type annotation. This version only uses @generated to unroll the splatting in the loop.
Another @generated version (apply6!) performs well even without the type annotation. This version unrolls all splatting and slurping.

I have looked at the @code_warntype representation of all versions, with or without type annotation. They are all similar, sometimes identical. However their @code_llvm looks vastly different (Though I am not knowledgeable enough to make sense of them). I cannot figure out by myself what is causing the performance hit. Is there a way I can spot the type annotation fix like you but perhaps without your profound understanding of Julia internals? Links/pointers to code/readings immensely appreciated.

Updated apply! variants all yielding the same (good) performance:

using BenchmarkTools

function k1(x1, x2, x3)
    x1 + x2 * x3
end

function apply00!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2] = f(getindex.(args, i)...)
    end
end

function apply0!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = f(map(a->a[i], args)...)
    end
end

function apply1!(f, y, args...)
    for i in eachindex(args[1])
        y[i*2 - i%2] = f(args[1][i], args[2][i], args[3][i])
    end
end

function apply2!(f, y, x1, x2, x3)
    for i in eachindex(x1)
        y[i*2 - i%2] = f(x1[i], x2[i], x3[i])
    end
end

function apply3!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = f(Base.ith_all(i, args)...)
    end
end

function apply4!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = _map_i(f, i, args...)
    end
end

@inline _map_i(f, i, x1) = f(x1[i])
@inline _map_i(f, i, x1, x2) = f(x1[i], x2[i])
@inline _map_i(f, i, x1, x2, x3) = f(x1[i], x2[i], x3[i])
@inline _map_i(f, i, x1, x2, x3, x4, args...) = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)

function _gmap_i_impl(targs::NTuple{N}) where N
    Expr(:call, :f, (:(args[$i][i]) for i in 1:N)...)
end

@generated function _gmap_i(f, i, args...)
    _gmap_i_impl(args)
end

function apply5!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2 - i%2] = _gmap_i(f, i, args...)
    end
end

function _gapply_impl(targs::NTuple{N}) where N
    ex = _gmap_i_impl(targs)
    :(for i in eachindex(args[1]); y[i*2-i%2]=$ex; end)
end

@generated function apply6!(f, y, args...)
    _gapply_impl(args)
end

n1, n2 = 1000,1000
x1 = randn(n1,n2)
x2 = randn(n1,n2)
x3 = randn(n1,n2)
y  = similar(x1, n1*2, n2)

@btime apply00!($k1, $y, $x1, $x2, $x3) # 3.203 ms (0 allocations: 0 bytes)
@btime apply0!($k1, $y, $x1, $x2, $x3)  # 3.253 ms (0 allocations: 0 bytes)
@btime apply1!($k1, $y, $x1, $x2, $x3)  # 3.251 ms (0 allocations: 0 bytes)
@btime apply2!($k1, $y, $x1, $x2, $x3)  # 3.219 ms (0 allocations: 0 bytes)
@btime apply3!($k1, $y, $x1, $x2, $x3)  # 3.260 ms (0 allocations: 0 bytes)
@btime apply4!($k1, $y, $x1, $x2, $x3)  # 3.242 ms (0 allocations: 0 bytes)
@btime apply5!($k1, $y, $x1, $x2, $x3)  # 3.248 ms (0 allocations: 0 bytes)
@btime apply6!($k1, $y, $x1, $x2, $x3)  # 3.156 ms (0 allocations: 0 bytes)

kristoffer.carlsson · October 31, 2018, 2:33am

Unfortunately, @code_typed (and its cousins) are in my experience quite bad at detecting this type of failure. They typically just print out as everything is fully specialized. Wrapping in one extra layer of function and calling @code_warntype on the wrapper can sometimes help.

cnliao · October 31, 2018, 3:01am

Thanks for the suggestion! A few things I tried, but @code_warntype still doesn’t give a clue:

function apply00_!(f, y, args...)
    for i in eachindex(args[1])
        y[i*2] = f(getindex.(args, i)...)
    end
end

function apply00!(f::F, y, args::Vararg{Any,N}) where{F,N}
    for i in eachindex(args[1])
        y[i*2] = f(getindex.(args, i)...)
    end
end

function passon(f, args...)
    f(args...)
end

function p1(f, y, x1, x2, x3)
    apply00!(f, y, x1, x2, x3)
end

function p1_(f, y, x1, x2, x3)
    apply00_!(f, y, x1, x2, x3)
end

# code_warntype shows identical code
@code_warntype p1(k1, y, x1, x2, x3)
@code_warntype p1_(k1, y, x1, x2, x3)
@code_warntype passon(apply00!, k1, y, x1, x2, x3)
@code_warntype passon(apply00_!, k1, y, x1, x2, x3)
@btime p1(k1, y, x1, x2, x3) # 3.104 ms (0 allocations: 0 bytes)
@btime p1_(k1, y, x1, x2, x3) # 396.373 ms (9998724 allocations: 183.09 MiB)

tim.holy · October 31, 2018, 9:21am

I agree this is more difficult to diagnose than it ideally should be. I generally get most value out of using the profiler: in ProfileView you’ll see the time dominated by red bars, or with Profile.print(C=true) you’ll see calls to jl_apply_generic or jl_invoke. In my experience when you get runtime dispatch (which is what these are revealing), yet @code_warntype is clean, you can pretty much count on the fact that it’s the specialization heuristics.

Like here, the answer is to start adding type annotations. In addition to the ones here, the other main example would be changing foo(t) to foo(t::TT) where TT<:Tuple when you know t is a tuple, since there are also heuristics that prevent runaway specialization on tuple-arguments. (EDIT: that used to be true but it doesn’t seem to be anymore, if I’m reading the C source correctly.)

But I still recommend against adding type annotations to everything; this doesn’t come up all that often, and there are good reasons for the limits on the amount of specialization. Since you can force it in performance-critical situations, the current state is actually quite good; the only real problem is that it would be nice to have an easier way to diagnose this issue.

ssfrr · October 31, 2018, 3:05pm

seconding Tim’s recommendation for the profiler. Also if you use Atom/Juno you can use the @profiler macro to get a really nice display in Atom, that also creates background bars directly in the source code showing how much relative time each line takes. Very slick.

traktofon · November 1, 2018, 10:01am

Is this behaviour that can be relied on, or is it just something that the current version of the compiler happens to do?

tim.holy · November 1, 2018, 10:05am

No promises, but it’s been true for quite a few releases, as long as I can remember being aware of this issue.

Stephen_Vavasis · November 1, 2018, 3:56pm

There is some rather technical information about this in the following manual chapter:

https://docs.julialang.org/en/v1.0/devdocs/functions/

under the heading “Compiler Efficiency Issues”.