I found the performance characteristic not intuitive when trying to splatting arguments into function calls. On Julia v1.0.1 Ubuntu 16.04 I get the following benchmark and numbers:

```
using BenchmarkTools
function k1(x1, x2, x3) # a kernel
x1 + x2 * x3
end
function apply0!(f, y, args...) # wish this could be fast
for i in eachindex(args[1])
# y[i*2 - i%2] = f(getindex.(args, i)...) # slower than map
y[i*2 - i%2] = f(map(a->a[i], args)...) # strange y index to make it not trivially @. broadcastable
end
end
function apply1!(f, y, args...) # this is not generic but fast
for i in eachindex(args[1])
y[i*2 - i%2] = f(args[1][i], args[2][i], args[3][i])
end
end
function apply2!(f, y, x1, x2, x3) # same performance as apply1! but even less generic
for i in eachindex(x1)
y[i*2 - i%2] = f(x1[i], x2[i], x3[i])
end
end
function apply3!(f, y, args...) # performance even worse although ith_all is inferable
for i in eachindex(args[1])
y[i*2 - i%2] = f(Base.ith_all(i, args)...)
end
end
function apply4!(f, y, args...) # suprisingly this doesn't help either since _map_i should be dispatch to the 3-args variant.
for i in eachindex(args[1])
y[i*2 - i%2] = _map_i(f, i, args...)
end
end
@inline _map_i(f, i, x1) = f(x1[i])
@inline _map_i(f, i, x1, x2) = f(x1[i], x2[i])
@inline _map_i(f, i, x1, x2, x3) = f(x1[i], x2[i], x3[i])
@inline _map_i(f, i, x1, x2, x3, x4, args...) = f(x1[i], x2[i], x3[i], x4[i], getindex.(args, i)...)
n1, n2 = 1000,1000
x1 = randn(n1,n2)
x2 = randn(n1,n2)
x3 = randn(n1,n2)
y = similar(x1, n1*2, n2)
@btime apply0!($k1, $y, $x1, $x2, $x3) # 110.086 ms (6999746 allocations: 122.07 MiB)
@btime apply1!($k1, $y, $x1, $x2, $x3) # 3.254 ms (0 allocations: 0 bytes)
@btime apply2!($k1, $y, $x1, $x2, $x3) # 3.209 ms (0 allocations: 0 bytes)
@btime apply3!($k1, $y, $x1, $x2, $x3) # 731.212 ms (12999234 allocations: 244.13 MiB)
@btime apply4!($k1, $y, $x1, $x2, $x3) # 230.310 ms (4998724 allocations: 76.27 MiB)
```

I can use generated function for this specific usage for now. But I think this performance penalty is quite surprising to new comers. Generated functions also introduce other drawbacks (not (Revise.jl)able, may fail PackageCompile, longer load time, etc.). So here are the questions:

- Is there a way to avoid the performance penalty without using generated functions?
- Is there a mental model that can help one predict the performance of the generated Julia code? In the above example all
`apply#!`

versions are type stable, and they look almost identical to`@code_warntype`

. However they have vastly different performance. In particular`apply1`

`apply3`

and`apply4`

are all slow, but what makes one even slower than the other? - As a newbie I often find it hard to write code that is BOTH generic and performant without resorting to metaprogramming. What tricks may I be missing? To name a few that I more or less know about: the Holy trait and the recursive-tail pattern.