How to speed up a generic function evaluating multiple array calculations

This is maybe more of a software design question than performance, or perhaps more adequate to Domains-Optimization category. I also don’t know if any available packages like JuMP or Flux could solve this for me already out-of-the-box, it would still be nice to understand how I might approach solving this problem from scratch.

I am working with large scale non-linear optimization problems (100s-10.000s of variables). Using a solver like should work just fine for me. The challenge is writing down the function to calculate the objective function residue terms and (sparse) Jacobian. My objective functions have multiple blocks where different residue functions are calculated with different inputs taken into account. In this example we have a fitting error term that depends on two arrays, and a regularization term that depends on successive differences of just one array.

I can write my objective function like concretefun, and it works just fine. It runs in 288ns in my machine. Now I would like to be able to write this function in a modular way, where each different function can be defined independently, separately from the final function that actually calculates the whole array. abstractfun here does that, although the performance is significantly lower (70us, 250x)

Using closures should pretty much solve the problem for me. It avoids having to write down this big function that takes as parameter every dependency of the other functions. Could I speed this up? Or should I just take a completely different approach?

using BenchmarkTools
using FastClosures

function concretefun(N::Int, out::Array{Float64}, a::Float64, ina::Array{Float64}, inb::Array{Float64})
    @inbounds begin

    for i in 1:N
        out[i] = (ina[i] - inb[i])

    for i in 1:N-1
        out[i + N] = a * (ina[i+1]-ina[i])


function abstractfun(out::Array{Float64}, funcs)
    @inbounds begin
        offset = 0
        for (range, fun) in funcs
            for i in 1:range
                out[i + offset] = fun(i)
            offset += range

out1 = zeros(N*2-1)
out2 = zeros(N*2-1)
ina = rand(N)
inb = rand(N)
a = 3.33

# f1(i) = a * ina[i] * inb[i]
# f2(i) = (a - 1.0)*(ina[i]-ina[i+1]) / (inb[i]-inb[i+1])

f1 = @closure (i) -> ina[i] - inb[i]
f2 = @closure (i) -> a * (ina[i+1] - ina[i])

funcs = [(N, f1), (N-1, f2)]

concretefun(N, out1, a, ina, inb)
abstractfun(out2, funcs)
@assert out1 == out2

@btime concretefun(N, out1, a, ina, inb)
@btime abstractfun(out2, funcs)

I’m not at a computer, so I can’t test it, but you should be able to get good performance by using a tuple of functions instead of an array (perhaps also with ). The tuple will allow the compiler to specialize on each function and even inline it if necessary.

1 Like

One of the performance loss is from missing inbounds. You can do:

@inline f1_impl(i, ina, inb) = @inbounds ina[i] - inb[i]
@inline f2_impl(i, a, ina, inb) = @inbounds a * (ina[i+1] - ina[i])

f1 = @closure (i) -> f1_impl(i, ina, inb)
f2 = @closure (i) -> f2_impl(i, a, ina, inb)

As @rdeits mentioned, you need to “unroll” type-heterogeneous loop. Actually, you don’t need any package for this because foldl is great:

function abstractfun2(out::Array{Float64}, funcs)
    foldl(funcs; init=0) do offset, (range, fun)
        @inbounds for i in 1:range
            out[i + offset] = fun(i)
        offset + range

But still, there is one allocation because Base.foldl is not @inlined. Fortunately, it’s super easy to fix:

@inline foldlargs(op, x) = x
@inline foldlargs(op, x1, x2, xs...) = foldlargs(op, op(x1, x2), xs...)

function abstractfun3(out::Array{Float64}, funcs)
    foldlargs(0, funcs...) do offset, (range, fun)
        @inbounds for i in 1:range
            out[i + offset] = fun(i)
        offset + range


@btime concretefun($N, $out1, $a, $ina, $inb)
  253.876 ns (0 allocations: 0 bytes)
@btime abstractfun($out2, $funcs)
  64.948 μs (4608 allocations: 72.00 KiB)
@btime abstractfun2($out2, $funcs)
  245.046 ns (1 allocation: 16 bytes)
@btime abstractfun3($out2, $funcs)
  216.239 ns (0 allocations: 0 bytes)

This looks great, thanks a lot @rdeits and @tkf!