How to speed up a generic function evaluating multiple array calculations

xor0110 · August 11, 2019, 2:57pm

This is maybe more of a software design question than performance, or perhaps more adequate to Domains-Optimization category. I also don’t know if any available packages like JuMP or Flux could solve this for me already out-of-the-box, it would still be nice to understand how I might approach solving this problem from scratch.

I am working with large scale non-linear optimization problems (100s-10.000s of variables). Using a solver like https://github.com/matthieugomez/LeastSquaresOptim.jl should work just fine for me. The challenge is writing down the function to calculate the objective function residue terms and (sparse) Jacobian. My objective functions have multiple blocks where different residue functions are calculated with different inputs taken into account. In this example we have a fitting error term that depends on two arrays, and a regularization term that depends on successive differences of just one array.

I can write my objective function like concretefun, and it works just fine. It runs in 288ns in my machine. Now I would like to be able to write this function in a modular way, where each different function can be defined independently, separately from the final function that actually calculates the whole array. abstractfun here does that, although the performance is significantly lower (70us, 250x)

Using closures should pretty much solve the problem for me. It avoids having to write down this big function that takes as parameter every dependency of the other functions. Could I speed this up? Or should I just take a completely different approach?

using BenchmarkTools
using FastClosures

function concretefun(N::Int, out::Array{Float64}, a::Float64, ina::Array{Float64}, inb::Array{Float64})
    @inbounds begin

    for i in 1:N
        out[i] = (ina[i] - inb[i])
    end

    for i in 1:N-1
        out[i + N] = a * (ina[i+1]-ina[i])
    end

    end
end

function abstractfun(out::Array{Float64}, funcs)
    @inbounds begin
        offset = 0
        for (range, fun) in funcs
            for i in 1:range
                out[i + offset] = fun(i)
            end
            offset += range
        end
    end
end

N=1024
out1 = zeros(N*2-1)
out2 = zeros(N*2-1)
ina = rand(N)
inb = rand(N)
a = 3.33

# f1(i) = a * ina[i] * inb[i]
# f2(i) = (a - 1.0)*(ina[i]-ina[i+1]) / (inb[i]-inb[i+1])

f1 = @closure (i) -> ina[i] - inb[i]
f2 = @closure (i) -> a * (ina[i+1] - ina[i])

funcs = [(N, f1), (N-1, f2)]

concretefun(N, out1, a, ina, inb)
abstractfun(out2, funcs)
@assert out1 == out2

@btime concretefun(N, out1, a, ina, inb)
@btime abstractfun(out2, funcs)

rdeits · August 12, 2019, 3:08am

I’m not at a computer, so I can’t test it, but you should be able to get good performance by using a tuple of functions instead of an array (perhaps also with https://github.com/cstjean/Unrolled.jl ). The tuple will allow the compiler to specialize on each function and even inline it if necessary.

tkf · August 12, 2019, 3:48am

One of the performance loss is from missing inbounds. You can do:

@inline f1_impl(i, ina, inb) = @inbounds ina[i] - inb[i]
@inline f2_impl(i, a, ina, inb) = @inbounds a * (ina[i+1] - ina[i])

f1 = @closure (i) -> f1_impl(i, ina, inb)
f2 = @closure (i) -> f2_impl(i, a, ina, inb)

As @rdeits mentioned, you need to “unroll” type-heterogeneous loop. Actually, you don’t need any package for this because foldl is great:

function abstractfun2(out::Array{Float64}, funcs)
    foldl(funcs; init=0) do offset, (range, fun)
        @inbounds for i in 1:range
            out[i + offset] = fun(i)
        end
        offset + range
    end
end

But still, there is one allocation because Base.foldl is not @inlined. Fortunately, it’s super easy to fix:

@inline foldlargs(op, x) = x
@inline foldlargs(op, x1, x2, xs...) = foldlargs(op, op(x1, x2), xs...)

function abstractfun3(out::Array{Float64}, funcs)
    foldlargs(0, funcs...) do offset, (range, fun)
        @inbounds for i in 1:range
            out[i + offset] = fun(i)
        end
        offset + range
    end
end

Results:

@btime concretefun($N, $out1, $a, $ina, $inb)
  253.876 ns (0 allocations: 0 bytes)
@btime abstractfun($out2, $funcs)
  64.948 μs (4608 allocations: 72.00 KiB)
@btime abstractfun2($out2, $funcs)
  245.046 ns (1 allocation: 16 bytes)
@btime abstractfun3($out2, $funcs)
  216.239 ns (0 allocations: 0 bytes)

xor0110 · August 13, 2019, 10:48am

This looks great, thanks a lot @rdeits and @tkf!

Topic		Replies	Views
Array of functions - is there a way to avoid allocations performance penalty? Performance memory-allocation , arrays	23	2364	May 23, 2019
Tricks to compile away step in calculation? General Usage	20	969	November 9, 2023
Making high performance functions with flexible inputs Performance	1	494	October 4, 2018
Performance and allocation issue with arrays of functions (v1.5) Performance performance , memory-allocation	6	668	August 17, 2020
Accelerate Non-linear function evaluation Performance tullio , loopvectorization	17	1316	April 6, 2021

How to speed up a generic function evaluating multiple array calculations

Related topics