Hi, I’m in a situation where I have to assemble an overall function from many other functions. I’m using code generation to create the overall function, so it does not really matter whether the implementation is “nice and short to code”, but it should be as performant as possible. On the other hand, the functions of which the overall function is composed may be written by different people with various skills/experience.

I found quite some interesting things while trying to optimise the execution speed… (everything below is run with `-O3`

). Here’s some preparation code required:

```
using BenchmarkTools
const u = collect(1.0:1.0:5)
const u_tpl = tuple(u...)
```

A simple example, as someone new to julia would probably implement it (using plain arrays):

```
inner1(u) = [ 2*u[1], 3*u[2], u[3]^2 ]
inner2(u) = [ sqrt(u[1]), sum(u) ]
function outer(u_out)
y_in1 = inner1(u_out[1:3])
y_in2 = inner2([y_in1[1:2]; u_out[4]])
return [y_in1; y_in2]
end
@show outer(u) # outer(u) = [2.0, 6.0, 9.0, 1.41421, 12.0]
@code_warntype outer(u) # OK, 119 instructions
@btime outer(u) # 2.318 μs (26 allocations: 1.45 KiB)
```

Now compare this to exclusively using tuples:

```
inner1(u) = ( 2*u[1], 3*u[2], u[3]^2 )
inner2(u) = ( sqrt(u[1]), sum(u) )
function outer(u_out)
y_in1 = inner1((u_out[1], u_out[2], u_out[3])) # note: u_out[1:3] destroys performance
y_in2 = inner2((y_in1[1], y_in1[2], u_out[4]))
return (y_in1..., y_in2...)
end
@show outer(u_tpl) # outer(u_tpl) = (2.0, 6.0, 9.0, 1.4142135623730951, 12.0)
@code_warntype outer(u_tpl) # OK, 17 instructions
@btime outer(u_tpl) # 1.340 ns (0 allocations: 0 bytes)
```

That’s a speedup of >1700! Now, the allocations of the first solution obviously are a killer. So let’s try with pre-allocation and in-place functions:

```
const yvec = zeros(5)
function inner1!(y, u)
y[1] = 2*u[1]
y[2] = 3*u[2]
y[3] = u[3]^2
return nothing
end
function inner2!(y, u)
y[1] = sqrt(u[1])
y[2] = sum(u)
return nothing
end
function outer!(y, u_out)
inner1!(view(y, 1:3), (u_out[1], u_out[2], u_out[3])) # note: using view(u_out, 1:3) --> 21.8 ns
inner2!(view(y, 4:5), (y[1], y[2], u_out[4])) # note: using yvec[1:2]... --> 386.9 ns (type instability)
return nothing
end
outer!(yvec, u)
@show yvec # yvec = [2.0, 6.0, 9.0, 1.41421, 12.0]
@code_warntype outer!(yvec, u) # OK, 155 instructions
@btime outer!(yvec, u) # 5.822 ns (0 allocations: 0 bytes)
# note: using u_tpl here reduces runtime to 4.240 ns
```

Close, but I still have to use the tuples created from scalars as inputs to the individual functions, otherwise the performance drops drastically. Now, that’s not that bad since those values should not be changed from within the functions anyways. But assuming someone wants to perform vector/matrix operations, or even has multidimensional data to pass around…?

I tried pre-allocating arrays for inputs to inner functions;

```
const inner1_uvec = zeros(3)
const inner2_uvec = zeros(3)
```

then using

```
function outer!(y, u_out)
@views inner1_uvec[1:3] = u_out[1:3]
inner1!(view(y, 1:3), inner1_uvec)
@views inner2_uvec[1:2] = y[1:2]
@views inner2_uvec[3] = u[4]
inner2!(view(y, 4:5), inner2_uvec)
return nothing
end
```

which leads to `38.193 ns (4 allocations: 192 bytes)`

.

Any thoughts on this in general, and advice on how to specifically approach the problem at hand? Maybe there’s some good practices I don’t know of yet. Could `StaticArrays`

help?