Array of functions - is there a way to avoid allocations performance penalty?

donboyd5 · May 22, 2019, 11:29am

Edited version of my initial post based upon the response from @Raf.

After I wrote this post (1) I got advice on how to write a better post. I have taken that advice below, and (2) I actually found the answer to my post by looking at Understanding the allocation behavior on arrays vs scalars - #2 by stevengj . Thus, I am editing the post to include the answer, in case others have the same question.

I want to maximize performance when I call functions that are array elements. My underlying motivation is that I plan to write income tax simulation models (tax calculators) for tax systems in ~41 states. The models will run on microdata files that represent taxpayers in the states. I want the models to be as parameter-driven as possible, where functions that do the necessary tax calculations are stored in arrays. The actual functions, the order in which they are processed, and their arguments, will vary from state to state, and from policy scenario (potential tax law) to policy scenario, defined in json files. It is a variant of and inspired by work being done here.

I was concerned because when I tested the idea of putting functions into arrays, performance degraded substantially. I have since discovered from the post linked above that this problem can be solved by using StaticArrays.

The remainder of this post shows the problem, and the solution.

The problem
The example below first defines consts for two functions (sin, cos) separately, and then defines a const array of these two functions. Then I call each of the two functions 10^6 times, two different ways: (1) calling the consts, and (2) calling the array elements. (I have constructed the example this way to isolate the issue, so that we can see - I believe - that it results solely from using an array of functions.)

The first approach has 13.5k memory allocations on the first run and only 6 allocations on the second run (6 allocations for @time I believe).

The second approach (array elements) has 6 M allocations on each call (and takes far more time when scaled up) and so is much less efficient. Because this is an approach I would like to be able to use, my initial question was, is there a way to make it more efficient?

const fvar1 = sin
const fvar2 = cos
const fvec = [sin, cos]

function fun_vars(n::Int64)
    t = 0.0
    for i = 1:n
        t += fvar1(i)
        t += fvar2(i)
    end
    return t
end

function fun_vec(n::Int64)
    t = 0.0
    for i = 1:n
        t += fvec[1](i)
        t += fvec[2](i)
    end
    return t
end

@time fun_vars(10^6) # 13.5k allocations
@time fun_vars(10^6) # 6 allocations on 2nd time

@time fun_vec(10^6) # 6 M allocations
@time fun_vec(10^6) # 6 M allocations on 2nd time

The solution
It turns out there is a way to make this more efficient. A solution - perhaps the best solution, I do not know - is to use StaticArrays, as shown below.

# Pkg.add("StaticArrays") 
using StaticArrays

const fvec2 = SVector(sin, cos)

function fun_vec2(n::Int64)
    t = 0.0
    for i = 1:n
        t += fvec2[1](i)
        t += fvec2[2](i)
    end
    return t
end

@time fun_vec2(10^6) # 16.6k allocations
@time fun_vec2(10^6) # 6 allocations on 2nd time

I hope this is helpful to someone.

Raf · May 22, 2019, 11:36am

Welcome!

Please put your code in a code block so it’s well formatted - so we can read it.

With a question like this, you could also describe what you are actually trying to do: the wider purpose of this code. Helping you make a better array of functions is not really helping you in the long term - you probably should be doing something else entirely. A little background will help people work out what that something is.

Raf · May 22, 2019, 12:13pm

Static arrays could be better, because they are really tuples underneath. You could instead just use a tuple.

Personally I would use a tuple of structs that are all passed to the same function, but they have their own methods defined for that function. That means you can attach additional parameters for each function if you need them, making it a lot more extensible. But it could also be overkill.

donboyd5 · May 22, 2019, 12:16pm

Many thanks. I will learn about this.

chakravala · May 22, 2019, 1:29pm

Currently, I’m working on a similar issue as @donboyd5 where I need an array of functions available.

In my case, I will be having anywhere from 1 to 64 functions in the array. In many cases, I will only need a few specific values from the whole array and sometimes I will need to calculate from all the functions in the array simultaneously.

For 1-64 functions in an array, with the need for computing individual values in some cases, and all values in other cases, would it be better to make a single function call, or individual calls? Hmm

Raf · May 22, 2019, 1:37pm

I don’t mean a single function call, but calling some run_it function for each struct in the tuple, triggering the appropriate method that matches the struct. The advantage is having self contained units that can hold additional parameters or preallocated arrays where required (if you need to run them millions of times).

I do this to build add-hoc composite models in my Celluar.jl and Dispersal.jl , GrowthRates.jl packages. But they are short (max 10 in a chain), and many need parameters - so I use a chain of structs that are passed to the same function in sequence. YMMV

chakravala · May 22, 2019, 1:38pm

Could you provide a minimal example?

kristoffer.carlsson · May 22, 2019, 1:42pm

GitHub - yuyichao/FunctionWrappers.jl was created for this purpose (see [RFC] Type stable wrapper for arbitrary callables · Issue #13984 · JuliaLang/julia · GitHub). I am however not sure how well it works on the latest julia release.

baggepinnen · May 22, 2019, 1:54pm

It seems you are not using the fact that the vector is a vector. A tuple should work equally well for your example?

Raf · May 22, 2019, 1:54pm

This is only worth doing if your functions are suitably complicated, and need parameters or preallocated arrays occasionally, which is a lot of the time for me. Eg. one such function needs a precalculated dispersal kernel attached, and identical sized array for doing blas multiplications.

# Interface
function dosomething end

# Functions
struct Square end
dosomething(t::Square, x) = x^2

struct ParameterMult{P}
    param::P
end
dosomething(t::ParameterMult, x) = t.param * x

runsim(xs, otherdata) = 
       # apply recursivly (ie type stable for a tuple) with list of structs and data, 
       # but basically like:
       dosomthing.(xs, Ref(otherdata))
end

runsim((Square(), ParameterMult(2)))

ExpandingMan · May 22, 2019, 2:23pm

I just wanted to take another opportunity to plug FunctionWrappers being in the stdlib or even Base. There aren’t that many things that you just flat out can’t do in Julia, but efficient containerized functions is one of them without FunctionWrappers. Additionally, FunctionWrappers is a small amount of code, but very few people have the knowledge to implement it.

I wish I could offer help and just put in a PR for this, but it’s way over my head.

Raf · May 22, 2019, 2:31pm

Also I forgot to add, if you want this to be type stable and as fast as possible, you might need to use recursion instead of a for loop. You are looping over a tuple/SVector of different types, so it wont compile the loop to be type-stable. Unless its only a few different functions, in which case small Union optimisations might help.

Unless you are actually calling every function in the tuple manually by index as in the example, that could be type stable too.

chakravala · May 22, 2019, 3:03pm

Could someone explain how exactly FunctionWrappers works? Is FunctionWrapper{A,Tuple{B...}} where {A,B,...} supposed to specify the type of the inputs and outputs with A,B,...?

ChrisRackauckas · May 22, 2019, 3:12pm

Yes, it’s a type parameterized by the input output types, so that way it can directly do the call in a way that’s type stable, but directly with that pointer and making it use that specific method (I’m not sure it can do dispatch). We’ve demonstrated that it is essentially no overhead on DiffEq

github.com/SciML/DiffEqBase.jl

No recompile ODE setup

SciML:master ← SciML:norecompile

opened 02:30AM - 14 Apr 19 UTC

ChrisRackauckas

+98 -43

```julia using OrdinaryDiffEq # First time function lorenz(du,u,p,t) …du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem{true,false}(lorenz,u0,tspan) @time sol = solve(prob,Tsit5()) #6.820159 seconds (16.06 M allocations: 800.508 MiB, 8.44% gc time) # Enable No-Recompile Mode function lorenz2(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end prob = ODEProblem{true,false}(lorenz2,u0,tspan) @time sol = solve(prob,Tsit5()) #0.004248 seconds (12.96 k allocations: 1.375 MiB) using BenchmarkTools @btime sol = solve(prob,Tsit5()) # 1.146 ms (12961 allocations: 1.37 MiB) # Disable No-Recompile Mode function lorenz3(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end prob = ODEProblem{true}(lorenz3,u0,tspan) @time sol = solve(prob,Tsit5()) #1.839436 seconds (3.97 M allocations: 190.402 MiB, 4.78% gc time) @btime sol = solve(prob,Tsit5()) # 1.078 ms (12963 allocations: 1.37 MiB) ``` @YingboMa . Note that this isn't safe to use without `remake` making sure types in the FunctionWrapper change whenever changing the types, which is important when doing things like AD. But this is what would be required for static compilation to be done completely.

but of course has some interesting constraints that you have to deal with.

chakravala · May 22, 2019, 3:37pm

In my case, it is needed for encoding the metric at a tangent space of a Riemannian manifold in differential geometry

The Hodge star requires evaluating a volume element of a submanifold using the metric at a particular point, which is currently considered constant and will be generalized to a FunctionWrapper next.

chakravala · May 22, 2019, 4:04pm

Is it possible to access the FunctionWrapper with getindex also? In my case, indexing is needed.

donboyd5 · May 22, 2019, 8:08pm

Thank you all for comments that were fast and valuable.

I realize, belatedly, that the “solution” I posted did not fully solve my problem, so here is an update that uses what I understand of what’s been written above and isolates my problem more precisely.

As two responders noted, all I need is a tuple of functions in my case, not an array.

However, when I call the functions by referencing each individual tuple member by its index, I allocate essentially no memory, but when I loop through the members, I allocate a lot of memory. The example below illustrates this.

I would like the benefits of looping through members (not needing to know the number of members, and not needing to write each function call out separately) but I would like the memory-allocation and speed benefits of not looping (i.e., essentially no memory allocation). I would much appreciate knowing whether it is possible to loop through the tuple of functions without the memory allocation penalty you see below. Thank you.


function test_elements(n::Int64)
    funs = (sin, cos)
    t = 0.0
    for i = 1:n
        t += funs[1](i)
        t += funs[2](i)
    end
    return t
end

function test_loop(n::Int64)
    funs = (sin, cos)
    t = 0.0
    for i = 1:n
        for f in funs
            t += f(i)
        end
    end
    return t
end

@time test_elements(10^6) # 14.8k allocations
@time test_elements(10^6) # 6 allocations

@time test_loop(10^6) # 7 M allocations
@time test_loop(10^6) # 7 M allocations

JeffreySarnoff · May 22, 2019, 8:26pm

Yes – getindex access works with “wrapped functions” in an array or a tuple.

foobar_lv2 · May 22, 2019, 9:29pm

As far as I understood, your arrays of functions are actually pretty small (~50) and you can probably hoist the dispatch: You could simply run a batch for each state. I.e. outer loop goes over states, inner loop over people.

If this is the case, then I recommend to just use a type for dispatch: Instead of storing a function, write function foo(::State{:AZ}, params... ) for special cases and function foo(::State{S}, params...) where S for shared code. Plug in a function boundary, i.e. call a @noinline batch_process(::State, data_batch) and then your performance penalty should go away.

Or is there a strong reason that your outer loop must go over people / entities and your inner loop over states?

donboyd5 · May 22, 2019, 9:54pm

Thank you.

I’m sorry if I implied the inner loop in the example runs over multiple states. In this example, I intend to deal with just a single state. (Much, much later on, there might be multiple states.) The inner loop in my example is intended to go over a set of perhaps 30-50 functions that might apply for a single state.

If there is a way to address the question in that context, it would be really great. That is, trying to get rid of the performance penalty from looping through a set (tuple) of functions.

Topic		Replies	Views
Function inside struct allocates when referenced New to Julia function	25	1141	November 7, 2023
Memory allocations when returning vectors General Usage array , memory-allocation	15	1546	June 6, 2018
Performance and allocation issue with arrays of functions (v1.5) Performance performance , memory-allocation	6	694	August 17, 2020
Array of functions and allocation General Usage	1	1632	April 13, 2019
"Composing" functions Performance	1	597	May 11, 2019

Array of functions - is there a way to avoid allocations performance penalty?

Related topics