# Array of functions - is there a way to avoid allocations performance penalty?

Edited version of my initial post based upon the response from @Raf.

After I wrote this post (1) I got advice on how to write a better post. I have taken that advice below, and (2) I actually found the answer to my post by looking at Understanding the allocation behavior on arrays vs scalars . Thus, I am editing the post to include the answer, in case others have the same question.

I want to maximize performance when I call functions that are array elements. My underlying motivation is that I plan to write income tax simulation models (tax calculators) for tax systems in ~41 states. The models will run on microdata files that represent taxpayers in the states. I want the models to be as parameter-driven as possible, where functions that do the necessary tax calculations are stored in arrays. The actual functions, the order in which they are processed, and their arguments, will vary from state to state, and from policy scenario (potential tax law) to policy scenario, defined in json files. It is a variant of and inspired by work being done here.

I was concerned because when I tested the idea of putting functions into arrays, performance degraded substantially. I have since discovered from the post linked above that this problem can be solved by using StaticArrays.

The remainder of this post shows the problem, and the solution.

The problem
The example below first defines consts for two functions (sin, cos) separately, and then defines a const array of these two functions. Then I call each of the two functions 10^6 times, two different ways: (1) calling the consts, and (2) calling the array elements. (I have constructed the example this way to isolate the issue, so that we can see - I believe - that it results solely from using an array of functions.)

The first approach has 13.5k memory allocations on the first run and only 6 allocations on the second run (6 allocations for @time I believe).

The second approach (array elements) has 6 M allocations on each call (and takes far more time when scaled up) and so is much less efficient. Because this is an approach I would like to be able to use, my initial question was, is there a way to make it more efficient?

const fvar1 = sin
const fvar2 = cos
const fvec = [sin, cos]

function fun_vars(n::Int64)
t = 0.0
for i = 1:n
t += fvar1(i)
t += fvar2(i)
end
return t
end

function fun_vec(n::Int64)
t = 0.0
for i = 1:n
t += fvec[1](i)
t += fvec[2](i)
end
return t
end

@time fun_vars(10^6) # 13.5k allocations
@time fun_vars(10^6) # 6 allocations on 2nd time

@time fun_vec(10^6) # 6 M allocations
@time fun_vec(10^6) # 6 M allocations on 2nd time



The solution
It turns out there is a way to make this more efficient. A solution - perhaps the best solution, I do not know - is to use StaticArrays, as shown below.

# Pkg.add("StaticArrays")
using StaticArrays

const fvec2 = SVector(sin, cos)

function fun_vec2(n::Int64)
t = 0.0
for i = 1:n
t += fvec2[1](i)
t += fvec2[2](i)
end
return t
end

@time fun_vec2(10^6) # 16.6k allocations
@time fun_vec2(10^6) # 6 allocations on 2nd time



I hope this is helpful to someone.

Welcome!

Please put your code in a code block so itâ€™s well formatted - so we can read it.

With a question like this, you could also describe what you are actually trying to do: the wider purpose of this code. Helping you make a better array of functions is not really helping you in the long term - you probably should be doing something else entirely. A little background will help people work out what that something is.

1 Like

Static arrays could be better, because they are really tuples underneath. You could instead just use a tuple.

Personally I would use a tuple of structs that are all passed to the same function, but they have their own methods defined for that function. That means you can attach additional parameters for each function if you need them, making it a lot more extensible. But it could also be overkill.

Currently, Iâ€™m working on a similar issue as @donboyd5 where I need an array of functions available.

In my case, I will be having anywhere from 1 to 64 functions in the array. In many cases, I will only need a few specific values from the whole array and sometimes I will need to calculate from all the functions in the array simultaneously.

For 1-64 functions in an array, with the need for computing individual values in some cases, and all values in other cases, would it be better to make a single function call, or individual calls? Hmm

I donâ€™t mean a single function call, but calling some run_it function for each struct in the tuple, triggering the appropriate method that matches the struct. The advantage is having self contained units that can hold additional parameters or preallocated arrays where required (if you need to run them millions of times).

I do this to build add-hoc composite models in my Celluar.jl and Dispersal.jl , GrowthRates.jl packages. But they are short (max 10 in a chain), and many need parameters - so I use a chain of structs that are passed to the same function in sequence. YMMV

Could you provide a minimal example?

https://github.com/yuyichao/FunctionWrappers.jl was created for this purpose (see https://github.com/JuliaLang/julia/issues/13984). I am however not sure how well it works on the latest julia release.

It seems you are not using the fact that the vector is a vector. A tuple should work equally well for your example?

3 Likes

This is only worth doing if your functions are suitably complicated, and need parameters or preallocated arrays occasionally, which is a lot of the time for me. Eg. one such function needs a precalculated dispersal kernel attached, and identical sized array for doing blas multiplications.

# Interface
function dosomething end

# Functions
struct Square end
dosomething(t::Square, x) = x^2

struct ParameterMult{P}
param::P
end
dosomething(t::ParameterMult, x) = t.param * x

runsim(xs, otherdata) =
# apply recursivly (ie type stable for a tuple) with list of structs and data,
# but basically like:
dosomthing.(xs, Ref(otherdata))
end

runsim((Square(), ParameterMult(2)))

1 Like

I just wanted to take another opportunity to plug FunctionWrappers being in the stdlib or even Base. There arenâ€™t that many things that you just flat out canâ€™t do in Julia, but efficient containerized functions is one of them without FunctionWrappers. Additionally, FunctionWrappers is a small amount of code, but very few people have the knowledge to implement it.

I wish I could offer help and just put in a PR for this, but itâ€™s way over my head.

5 Likes

Also I forgot to add, if you want this to be type stable and as fast as possible, you might need to use recursion instead of a for loop. You are looping over a tuple/SVector of different types, so it wont compile the loop to be type-stable. Unless its only a few different functions, in which case small Union optimisations might help.

Unless you are actually calling every function in the tuple manually by index as in the example, that could be type stable too.

Could someone explain how exactly FunctionWrappers works? Is FunctionWrapper{A,Tuple{B...}} where {A,B,...} supposed to specify the type of the inputs and outputs with A,B,...?

Yes, itâ€™s a type parameterized by the input output types, so that way it can directly do the call in a way thatâ€™s type stable, but directly with that pointer and making it use that specific method (Iâ€™m not sure it can do dispatch). Weâ€™ve demonstrated that it is essentially no overhead on DiffEq

but of course has some interesting constraints that you have to deal with.

3 Likes

In my case, it is needed for encoding the metric at a tangent space of a Riemannian manifold in differential geometry

The Hodge star requires evaluating a volume element of a submanifold using the metric at a particular point, which is currently considered constant and will be generalized to a FunctionWrapper next.

Is it possible to access the FunctionWrapper with getindex also? In my case, indexing is needed.

Thank you all for comments that were fast and valuable.

I realize, belatedly, that the â€śsolutionâ€ť I posted did not fully solve my problem, so here is an update that uses what I understand of whatâ€™s been written above and isolates my problem more precisely.

As two responders noted, all I need is a tuple of functions in my case, not an array.

However, when I call the functions by referencing each individual tuple member by its index, I allocate essentially no memory, but when I loop through the members, I allocate a lot of memory. The example below illustrates this.

I would like the benefits of looping through members (not needing to know the number of members, and not needing to write each function call out separately) but I would like the memory-allocation and speed benefits of not looping (i.e., essentially no memory allocation). I would much appreciate knowing whether it is possible to loop through the tuple of functions without the memory allocation penalty you see below. Thank you.


function test_elements(n::Int64)
funs = (sin, cos)
t = 0.0
for i = 1:n
t += funs[1](i)
t += funs[2](i)
end
return t
end

function test_loop(n::Int64)
funs = (sin, cos)
t = 0.0
for i = 1:n
for f in funs
t += f(i)
end
end
return t
end

@time test_elements(10^6) # 14.8k allocations
@time test_elements(10^6) # 6 allocations

@time test_loop(10^6) # 7 M allocations
@time test_loop(10^6) # 7 M allocations



Yes â€“ getindex access works with â€śwrapped functionsâ€ť in an array or a tuple.

As far as I understood, your arrays of functions are actually pretty small (~50) and you can probably hoist the dispatch: You could simply run a batch for each state. I.e. outer loop goes over states, inner loop over people.

If this is the case, then I recommend to just use a type for dispatch: Instead of storing a function, write function foo(::State{:AZ}, params... ) for special cases and function foo(::State{S}, params...) where S for shared code. Plug in a function boundary, i.e. call a @noinline batch_process(::State, data_batch) and then your performance penalty should go away.

Or is there a strong reason that your outer loop must go over people / entities and your inner loop over states?

Thank you.

Iâ€™m sorry if I implied the inner loop in the example runs over multiple states. In this example, I intend to deal with just a single state. (Much, much later on, there might be multiple states.) The inner loop in my example is intended to go over a set of perhaps 30-50 functions that might apply for a single state.

If there is a way to address the question in that context, it would be really great. That is, trying to get rid of the performance penalty from looping through a set (tuple) of functions.