On the performance of function calls that depends on a variable

Hello everyone,

Sorry, it’s going to be a bit long.

I’m trying to reach performance in Julia for function calls that depends on a variable. It means given a line in my code, I can’t tell which function will be run in advance (I only know its type signature).

So I’ve created a struct type to store the function within it. However, I’m facing performance issues. I’ve made some benchmark in order to investigate.

import Statistics: mean
using BenchmarkTools

# Creation of types that store a Function or a function Symbol.
EdgeTransition = Union{Nothing,Vector{Symbol}}
struct EdgeStruct
    tr::EdgeTransition
    func1::Function
    func2::Function
end
struct EdgeStruct2
    tr::EdgeTransition
    func1::Symbol
    func2::Symbol
end
EdgeTuple = Tuple{EdgeTransition,Function,Function}
EdgeTuple2 = Tuple{EdgeTransition,Symbol,Symbol}

# Function that will be called through an edge object
function f(t::Float64, values::Vector{Float64}, x::Vector{Int}, p::Vector{Float64})
    return (t <= 0.025 && (values[1] < 50.0 || values[1] > 75.0))
end

t = 0.1
values = [100.0, Inf, 0.0]
x = [99, 99, 1, 0] 
p = [1.0, 1.0]

edge_struct_1 = EdgeStruct(nothing, getfield(Main, :f), getfield(Main, :mean))
edge_struct_2 = EdgeStruct2(nothing, :f, :mean)
edge_tuple_1 = (nothing, getfield(Main, :f), getfield(Main, :mean))
edge_tuple_2 = (nothing, :f, :mean)
@assert typeof(edge_struct_1) <: EdgeStruct && typeof(edge_struct_2) <: EdgeStruct2 &&
        typeof(edge_tuple_1) <: EdgeTuple && typeof(edge_tuple_2) <: EdgeTuple2

println("Time execution of f")
@btime f(t, values, x, p)

println("Time execution of f with edges")
println("- Structs")
@btime getfield(edge_struct_1, :func1)(t, values, x, p)
@btime getfield(Main, getfield(edge_struct_2, :func1))(t, values, x, p)

println("- Tuples")
@btime edge_tuple_1[2](t, values, x, p)
@btime getfield(Main, edge_tuple_2[2])(t, values, x, p)

println("Time access of variables")
println("- Structs")
@btime getfield(edge_struct_1, :func1)
@btime getfield(edge_struct_2, :func1)
println("- Tuples")
@btime edge_tuple_1[2]
@btime edge_tuple_2[2]

which gives the output:

Time execution of f
  4.466 ns (0 allocations: 0 bytes)
Time execution of f with edges
- Structs
  24.571 ns (0 allocations: 0 bytes)
  46.807 ns (0 allocations: 0 bytes)
- Tuples
  262.293 ns (0 allocations: 0 bytes)
  275.926 ns (0 allocations: 0 bytes)
Time access of variables
- Structs
  10.826 ns (0 allocations: 0 bytes)
  10.826 ns (0 allocations: 0 bytes)
- Tuples
  235.832 ns (0 allocations: 0 bytes)
  237.547 ns (0 allocations: 0 bytes)

I’ve considered four ways of storing a function label in a variable. The best I’ve found so far, according to the BenchmarkTools package, is to create a new struct type, stores the function in a field of type func1::Function and get access to the function via getfield(Main, :func1).

However, one can see the time execution of f() plus the time memory access of the edge_struct_1 variable is about 15ns, but the call of getfield(edge_struct_1, :func1)(t, values, x, p) is about 25ns.

Where are the 10ns left? I’ve pursued my investigations and found via the --track-allocation=user option and Coverage.jl package that an allocation of memory is made each time getfield(edge_struct_1, :func1)(t, values, x, p) is executed in my code. This refutes the result of @btime.

10ns and one allocation (actually 2 allocations because two functions are called the same way) is not that much but this function is executed a huge number of times and I think it has a quite big impact on the performance of my methods, especially the memory allocation.

Did I miss something or made an error about my conclusions? Does someone have any idea to avoid this allocation by implementing this problem from another point of view?

Thank you for your attention!

Ok, lots to unpack here, so I’ll start with some easy things in the hope that it will help simplify the problem.

Doing getfield(edge_struct_1, :func1) is pretty unusual, and there’s rarely any reason to do so. It’s certainly not necessary in your case, and it should be easier to write and faster (or at least equally fast) to do edge_struct_1.func1. Using getfield in this way offers no benefit, only more complexity.

Likewise, storing functions by Symbol and then looking them up via getfield(Main, func_name) is adding unnecessary complexity and slowing down your code. Instead of storing the name :mean and then looking it up by name, just store the actual function mean. Aside from being more performant, this will work with any function, where the getfield(Main, ...) approach falls apart as soon as you want something which is not named in Main.

So with that in mind, I would suggest that get rid of the EdgeStruct2 approach entirely, since it requires extra computation to look up the functions by name, offers no benefits, and prevents you from working with anonymous functions or any other functions which are not named in Main.

If that seems reasonable, then can you try to rewrite your current example without any usages of getfield? It should help clarify what’s going on, and might be a useful exercise on its own.

Beyond that, the right answer may depend on the specifics of your problem. Function is an abstract type, and we generally recommend against having struct fields with abstract type for performance reasons. Sometimes it can be unavoidable, though (e.g. if you want to store lots of different EdgeStructs all with different functions). One alternative approach is to parameterize the EdgeStruct on the type of its individual functions:

struct EdgeStruct{F1 <: Function, F2 <: Function}
  tr::EdgeTransition
  func1::F1
  func2::F2
end

This allows each field to have a specific concrete type corresponding to the particular function being stored, but it can make storing lots of EdgeStructs with different function types a bit more difficult.

8 Likes

Ignore the timings below, as @DNF pointed I messed up the interpolation, I will re-do this with the Ref trick after.

I am not sure if there is not an artifact in your timing because you did not interpolate the variable values in the @btime macro. I interpolated everything (except Main that I was not sure if I should) and I got the following in Julia 1.5.3:

Time execution of f
  0.018 ns (0 allocations: 0 bytes)
Time execution of f with edges
- Structs
  0.016 ns (0 allocations: 0 bytes)
  0.018 ns (0 allocations: 0 bytes)
- Tuples
  0.017 ns (0 allocations: 0 bytes)
  0.017 ns (0 allocations: 0 bytes)
Time access of variables
- Structs
  0.018 ns (0 allocations: 0 bytes)
  0.017 ns (0 allocations: 0 bytes)
- Tuples
  246.146 ns (0 allocations: 0 bytes)
  236.981 ns (0 allocations: 0 bytes)

Hello,

Thank you it is a very interesting answer!

For the getfield syntax, my reasoning was to bypass the call of getproperty. Indeed getproperty can be overwritten and it was overwritten for some other types of my code. In the end, I haven’t overwritten it for this specific type so I can change it.
I ran the benchmark by adding your solution and changing the getfield syntax to edge.func1 and obtained:

Time execution of f
  4.629 ns (0 allocations: 0 bytes)
Time execution of f with edges
- Structs
  39.582 ns (1 allocation: 32 bytes)
  68.655 ns (1 allocation: 32 bytes)
  46.009 ns (1 allocation: 16 bytes)
- Tuples
  263.109 ns (0 allocations: 0 bytes)
  275.636 ns (0 allocations: 0 bytes)
Time access of variables
- Structs
  25.620 ns (1 allocation: 32 bytes)
  25.942 ns (1 allocation: 32 bytes)
  30.745 ns (1 allocation: 16 bytes)
- Tuples
  245.922 ns (0 allocations: 0 bytes)
  235.720 ns (0 allocations: 0 bytes)

First, they are allocations for the struct types. Maybe the allocation I’ve detected with Coverage.jl and --track-allocation is revealed with @btime by using the dot syntax of field access.
The third line of each “- Structs” parts of the output shows the result for your solution. The allocation is divided by two! However, time’s slightly higher. But at this time scale I don’t know if it is reliable.

Do you know how the Function type works ? I’ve noticed that each function has a specific concrete type called typeof(func) but I don’t know if some of the functions have the same concrete type (for example if they share the same type signature).

I’m going to test your solution with my bigger code, thanks a lot!

UPDATE:

I tested your struct syntax and actually, it decreases the performance (about +50% runtime with a scale of milliseconds). I think maybe this is due to what you’ve anticipated: a lot of different structs are created and it has an impact on the performance.

Hello,

Thank you for your answer. Actually I didn’t interpolate on purpose.

Indeed, my goal is to see the computation time of everything including the time access of the variables. If I interpolate, I discard the time access of my function that is stored into a struct object.

I revealed with other tools (Coverage.jl) that an allocation (malloc()) is made each time I get an access to the function. This computation time is not revealed if I interpolate my variables within the @btime macro.

Sorry to interfere, but in order to avoid XYProblem can you explain original problem that you are trying to solve? What you are trying to do looks very unidiomatic and probably it is possible to change your approach one step before the current issue.

4 Likes

Hello,

I haven’t specified what i’m precisely implementing in order to reduce the length of the topic (it is already quite long). Here is a summary.

I’m implementing a specific type of automata. An automaton is represented by a struct type. They are nodes and edges among other things. Each edge contains a predicate function.

These automata objects are created by metaprogramming. I don’t know in advance the name of the predicate functions contained in these edges because the edges, including predicate functions, are created by metaprogramming.

Sorry for continue derailing this topic, but have you seen Automa.jl and also this blogpost which shows how to use it? I think authors had the same problem as you do and they have developed their own approach to solve it, maybe it’ll be useful for you too.

Also, how much control you have over metaprogramming part? For example, instead of generating one struct with various functions, you can generate multiple structs and let multiple dispatch do its job

struct EdgeStruct1
  tr::EdgeTransition
end

struct EdgeStruct2
  tr::EdgeTransition
end

func1(e::EdgeStruct1, t, values, x, p)  = t <= 0.025 && (values[1] < 50.0 || values[1] > 75.0)
func1(e::EdgeStruct2, t, values, x, p)  = t > 0.225 && values[1] < 12.0
1 Like

Something went wrong with the benchmarking here, which is that the compiler was able to do all the work at compile time, leaving nothing for runtime, so this is an example of benchmark interpolation gone wrong (timings below 1ns are generally not meaningful.)

You need the Ref trick here (afraid I can never remember how to do that.)

1 Like

Don’t be sorry I think you’ve perfectly understood the kind of issues I’m facing.

I didn’t know the Automa.jl package but the tutorial you’ve sent indeed seems very useful for my case. I’m going to go deeper into the reading. Also, it looks like a very interesting package, thanks for the share!

I have total control over the metaprogramming part. With your solution, I assume I have to create an abstract type Edge and all the constructed edges should inherit from this abstract type. Normally after one execution, the Julia compiler should infer well. It looks like a very elegant solution…

Thanks a lot for your answer!

2 Likes

Hello again,

After spending some time considering your answers, I managed to identify what’s going on about my lack of performance. Here is a return.

Things are getting worst when I have to access functions into any collection. I wrote a benchmark script in order to test several mentioned methods. It creates a bunch of test functions and access and call them into different data structures.

using BenchmarkTools

println("Cost of 20 boolean tests:")
f_test(a::Int) = a == 2
@btime for i = 1:20
    f_test(3)
end

# Multiple dispatch
abstract type SuperType end
for i = 1:10
    include_string(Main, "struct Type$i <: SuperType v::Symbol end")
    include_string(Main, "f(item::Type$i, a::Int) = a == $(i)")
end
vec_types = SuperType[]
for i = 1:20
    rand_number_type = rand(1:10)
    @eval item = $(Symbol("Type$(rand_number_type)"))(:test)
    push!(vec_types, item)
end
println("Multiple dispatch:")
function read(vec_types::Vector{SuperType}, a::Int) 
    for item in vec_types
        f(item, a)
    end
end
@btime read(vec_types, 3)

# Parametric struct
struct AnotherSuperType{T<:Function}
    v::Symbol 
    f::T
end
vec_others_types = AnotherSuperType[]
for i = 1:20
    rand_number_type = rand(1:10)
    include_string(Main, "f_other_$(i)(a::Int) = a == $(i)")
    sym_func = Symbol("f_other_$i")
    @eval push!(vec_others_types, AnotherSuperType(:test, $(sym_func)))
end
println("Parametrized struct:")
function read(vec_others_types::Vector{AnotherSuperType}, a::Int)
    for item in vec_others_types
        item.f(a)
    end
end
@btime read(vec_others_types, 3)

# With two vectors
println("Vectors:")
vec_tr = Union{Nothing,Symbol}[]
vec_func = Function[]
for i = 1:20
    rand_number_type = rand(1:10)
    include_string(Main, "f_vec_$(i)(a::Int) = a == $(i)")
    sym_func = Symbol("f_vec_$i")
    push!(vec_tr, :test)
    @eval push!(vec_func, $(sym_func))
end
function read(vec_tr::Vector{Union{Nothing,Symbol}}, vec_func::Vector{Function}, a::Int) 
    for i = eachindex(vec_tr)
        vec_func[i](a)
    end
end
@btime read(vec_tr, vec_func, 3)

It gives the output:

Cost of 20 boolean tests:
  2.207 ns (0 allocations: 0 bytes)
Multiple dispatch:
  803.337 ns (0 allocations: 0 bytes)
Parametrized struct:
  1.833 μs (0 allocations: 0 bytes)
Vectors:
  1.172 μs (0 allocations: 0 bytes)

The first “multiple dispatch” method is the one mentioned by @Skoffer whereas the “parametrized struct” method is the one mentioned by @rdeits. The third test checks if the separation of data in several vectors (a specific vector for functions and another vector for the associated data) improves the performance.

The multiple dispatch method performs best. But all these methods allocate memory each time one of the test functions is called. As you can see, the btimes are high compared to what the functions are actually doing.

@Skoffer I read about the Automa.jl package, it was really interesting. However I think it doesn’t work in my case. Indeed their automata are deterministic but mine are deterministic in a more complex sense which allows several edges between two states of the automaton.

In my mind, it should be possible to improve the performance of such code because I know exactly the signature of my test functions. I suppose the main reason for the lack of performance is Function is an abstract type and it is impossible for the compiler to infer the type of the test function at each execution.

Maybe I should use a lower syntax that loads the pointer of each function and run the function according to this pointer, but I have no idea how to do such thing in Julia. I’m running out of ideas.

Anyway thanks a lot for the answers!

That’s basically what GitHub - yuyichao/FunctionWrappers.jl does. If you actually need to store a large number of different functions with the same signature and then call them with very little overhead per function, then you might find it useful.

2 Likes

Awesome I’m going to check it, thanks!