Hi, I’m in a situation where I have to assemble an overall function from many other functions. I’m using code generation to create the overall function, so it does not really matter whether the implementation is “nice and short to code”, but it should be as performant as possible. On the other hand, the functions of which the overall function is composed may be written by different people with various skills/experience.
I found quite some interesting things while trying to optimise the execution speed… (everything below is run with
-O3). Here’s some preparation code required:
using BenchmarkTools const u = collect(1.0:1.0:5) const u_tpl = tuple(u...)
A simple example, as someone new to julia would probably implement it (using plain arrays):
inner1(u) = [ 2*u, 3*u, u^2 ] inner2(u) = [ sqrt(u), sum(u) ] function outer(u_out) y_in1 = inner1(u_out[1:3]) y_in2 = inner2([y_in1[1:2]; u_out]) return [y_in1; y_in2] end @show outer(u) # outer(u) = [2.0, 6.0, 9.0, 1.41421, 12.0] @code_warntype outer(u) # OK, 119 instructions @btime outer(u) # 2.318 μs (26 allocations: 1.45 KiB)
Now compare this to exclusively using tuples:
inner1(u) = ( 2*u, 3*u, u^2 ) inner2(u) = ( sqrt(u), sum(u) ) function outer(u_out) y_in1 = inner1((u_out, u_out, u_out)) # note: u_out[1:3] destroys performance y_in2 = inner2((y_in1, y_in1, u_out)) return (y_in1..., y_in2...) end @show outer(u_tpl) # outer(u_tpl) = (2.0, 6.0, 9.0, 1.4142135623730951, 12.0) @code_warntype outer(u_tpl) # OK, 17 instructions @btime outer(u_tpl) # 1.340 ns (0 allocations: 0 bytes)
That’s a speedup of >1700! Now, the allocations of the first solution obviously are a killer. So let’s try with pre-allocation and in-place functions:
const yvec = zeros(5) function inner1!(y, u) y = 2*u y = 3*u y = u^2 return nothing end function inner2!(y, u) y = sqrt(u) y = sum(u) return nothing end function outer!(y, u_out) inner1!(view(y, 1:3), (u_out, u_out, u_out)) # note: using view(u_out, 1:3) --> 21.8 ns inner2!(view(y, 4:5), (y, y, u_out)) # note: using yvec[1:2]... --> 386.9 ns (type instability) return nothing end outer!(yvec, u) @show yvec # yvec = [2.0, 6.0, 9.0, 1.41421, 12.0] @code_warntype outer!(yvec, u) # OK, 155 instructions @btime outer!(yvec, u) # 5.822 ns (0 allocations: 0 bytes) # note: using u_tpl here reduces runtime to 4.240 ns
Close, but I still have to use the tuples created from scalars as inputs to the individual functions, otherwise the performance drops drastically. Now, that’s not that bad since those values should not be changed from within the functions anyways. But assuming someone wants to perform vector/matrix operations, or even has multidimensional data to pass around…?
I tried pre-allocating arrays for inputs to inner functions;
const inner1_uvec = zeros(3) const inner2_uvec = zeros(3)
function outer!(y, u_out) @views inner1_uvec[1:3] = u_out[1:3] inner1!(view(y, 1:3), inner1_uvec) @views inner2_uvec[1:2] = y[1:2] @views inner2_uvec = u inner2!(view(y, 4:5), inner2_uvec) return nothing end
which leads to
38.193 ns (4 allocations: 192 bytes).
Any thoughts on this in general, and advice on how to specifically approach the problem at hand? Maybe there’s some good practices I don’t know of yet. Could