Improve performance of function that produces and hcats SMatrices

question
performance

#1

I have two functions, which for our MWE could be the following:

using BenchmarkTools, StaticArrays
@inline function f(x, p, n)
    @inbounds x1, x2, x3 = x[1], x[2], x[3]
    SVector( 3.8*x1*(1-x1) - 0.05*(x2+0.35)*(1-2*x3),
    0.1*( (x2+0.35)*(1-2*x3) - 1 )*(1 - 1.9*x1),
    3.78*x3*(1-x3)+0.2*x2 )
end
@inline function jac(x, p, n)
    @SMatrix [3.8*(1 - 2x[1]) -0.05*(1-2x[3]) 0.1*(x[2] + 0.35);
    -0.19((x[2] + 0.35)*(1-2x[3]) - 1)  0.1*(1-2x[3])*(1-1.9x[1])  -0.2*(x[2] + 0.35)*(1-1.9x[1]);
    0.0  0.2  3.78(1-2x[3]) ]
end

I don’t think the actual functions matter, as long as the first returns SVector and the latter returns SMatrix.
Let’s do timings:

p = nothing
s = rand(SVector{3}); Q = rand(SMatrix{3,3})
@btime $f($s, $p, 0)
@btime $jac($s, $p, 0)
  6.158 ns (0 allocations: 0 bytes)
  5.337 ns (0 allocations: 0 bytes)

Now, I create a new function, based on the above:

        ws_index = SVector{3, Int}(2:4...)
        tangentf = (u, p, t) -> begin
            du = f(u[:, 1], p, t)
            J = jac(u[:, 1], p, t)
            dW = J*u[:, ws_index]
            return hcat(du, dW)
        end

and benchmark to see:

S = hcat(s, Q)
@btime $(tangentf)($S, $p, 0)

  109.685 ns (6 allocations: 496 bytes)
3Ă—4 StaticArrays.SArray{Tuple{3,4},Float64,2,12}:
 0.567631  -1.8819    -1.73773   -1.12878 
 0.102564   0.353318   0.263594   0.288192
 0.656533  -1.68341   -0.506125  -1.94484 

There are allocations done, and the timing is much larger than the timing of applying the 2 individual functions and performing an hcat (which has miniscule timing).

So, how can I improve this and most importantly: what am I doing wrong?


#2

Don’t use global variables when benchmarking.


#3

Thanks!

Doing instead

tangentf = let ws_index = SVector{3, Int}(2:4...)
        tangentf = (u, p, t) -> begin
            du = f(u[:, 1], p, t)
            J = jac(u[:, 1], p, t)
            dW = J*u[:, ws_index]
            return hcat(du, dW)
        end
      end

S = hcat(s, Q)
@btime $(tangentf)($S, $p, 0)

  36.128 ns (0 allocations: 0 bytes)
3Ă—4 StaticArrays.SArray{Tuple{3,4},Float64,2,12}:
  0.872362   1.0066    0.474349   0.599955 
 -0.0237808  0.143415  0.0510413  0.0682469
  0.742914   0.813412  1.08132    1.91705  

Gives super massive improvements. Still it is not close to the individual function evaluations… Why is that?


#4

Using a dedicated struct gave even more performance boost:

struct TangentOOP{F, JAC, k}
    f::F
    jacobian::JAC
    ws::SVector{k, Int}
end
@inline function (tan::TangentOOP)(u, p, t)
    du = tan.f(u[:, 1], p, t)
    J = tan.jacobian(u[:, 1], p, t)
    dW = J*u[:, tan.ws]
    return hcat(du, dW)
end

benchmarking the later with

tan = TangentOOP(f, jac, SVector{3, Int}(2:4...))

@btime ($tan)($S, $p, 0);

  22.169 ns (0 allocations: 0 bytes)