Speed of dispatch with heterogeneous functions and types

I am planning to implement a continuous-time version of a discrete-time agent-based simulation (see https://github.com/mhinsch/RRGraphs). That means there will be some sort of central scheduler that handles dispatch of actions with, both, the type of the objects and the functions to be dispatched varying dynamically. In the light of this, I thought it would make sense to invest some time to find the fastest method.

Here is some test code I wrote: https://gist.github.com/mhinsch/773b4d34012090c6c551e32ff3f2f576. Basically there are two types of agents and two types of actions where the combination of the two is determined at run-time. I expect actions to be not entirely trivial, therefore the random numbers in step*. Part of the consideration also has to be the time it takes to line up agents and actions to be consumed by the scheduler which is why I am benchmarking the entire process not just the scheduling itself.

The results (Julia 1.1) are interesting, I think, and somewhat unexpected (for me):

static dispatch:
  4.800 ms (49 allocations: 4.00 MiB)
generic function, command array:
  6.521 ms (34 allocations: 4.00 MiB)
generic function:
  8.255 ms (89454 allocations: 5.37 MiB)
plain closures:
  19.733 ms (189371 allocations: 4.89 MiB)
typed closures:
  7.245 ms (189291 allocations: 4.89 MiB)
manual type check:
  6.290 ms (34 allocations: 4.00 MiB)
array of functors:
  8.842 ms (189463 allocations: 4.89 MiB)
FunctionWrapper with functor:
  11.308 ms (300017 allocations: 9.63 MiB)
FunctionWrapper with closure:
  6.596 ms (300017 allocations: 9.63 MiB)

To sum it up, from what I have read (and tested) before I would have expected for generic function calls to perform a lot worse and FunctionWrappers with functors a lot better.

In any case, any suggestions on how to improve the code or even entirely different solutions for the problem will be gratefully received :slight_smile:

An MWE would help. Its an interesting question but that’s too much code.

My first worry is the code that is populating the agents/action structures. Is that part of scheduling or not…If it’s not I’m worried that your “dispatch” numbers are being overshadowed by the initialization. So breaking the population of the arrays into a different function would probably give you better results.

If it is part you can change sched_gen's first two lines to:

    agents = Array{Agent}(undef, n)
    action = Array{Int}(undef, n)

You will probably get better results just because multiple allocations to populate the arrays turn into exactly 2 allocations. It appears all the tests except sched_static can benefit from this type of change, only in sched_static you can’t predict how large the arrays need to be.

Well, the point of the exercise is to test various methods against each other. I can’t make each of them more minimal than they already are.

That’s a good point, but in practice the array of actions/agents will grow and shrink dynamically as a result of what’s going on in the simulation (which therefore has to be part of the benchmark), so filling them like I did seems like a reasonable approximation to what’s going to happen in real life.

Edit: More specifically, the amount of allocation needed to hold the necessary structures (i.e. actions + objects) is part of the problem.

Far enough.

Looking at the numbers for this:

function pop_static(a1, a2, n)
	agents1 = Agent1[]
	agents2 = Agent2[]
	action = Int[]

	for i in 1:n
		if rand(1:2) == 1
			push!(agents1, rand(a1))
			push!(action, rand(1:2))
			push!(agents2, rand(a2))
			push!(action, rand(1:2))

	(agents1, agents2, action)

function sched_static(agents1, agents2, action)
	i = 1
	for a in agents1
		action[i] == 1 ? step1(a) : step2(a)
		i += 1
	for a in agents2
		action[i] == 1 ? step1(a) : step2(a)
		i += 1

const ns = 100_000

agents1, agents2, action = pop_static(arr1, arr2, 1)
sched_static(agents1, agents2, action)


agents1, agents2, action = @btime pop_static(arr1, arr2, ns)
@btime sched_static(agents1, agents2, action)

I get:

  4.367 ms (50 allocations: 4.00 MiB)
  1.211 ms (0 allocations: 0 bytes)

So from an optimization standpoint there is more room to optimize populating the data structures than dispatching the calls…So that is what I would focus on first…

Hmm, you have a point there. I’ll have to think about what that implies for the project…

Just an FYI, I was curious how much “overhead” the rand method was adding to the dispatch process so changing:

@noinline step1(a :: Agent1) = a.x += 1

@noinline step1(a :: Agent2) = a.y -= 2

@noinline step2(a :: Agent1) = a.x += 3

@noinline step2(a :: Agent2) = a.y -= 4

The dispatching time dropped by about 700ms which is also a sizeable chunk of the dispatch time…

Well about by 700ms when I did the “sched_gen” flow as two pieces…when I did that to the sched_static flow the dispatch time hit 615us…

Yes, that’s expected. My idea was to add a bit of minimal “load” to the step functions so that speed differences can be judged in a somewhat realistic scenario. E.g. a difference of 700ms doesn’t matter if the whole process (allocation + dispatch + simulation) takes 10s anyway. The comparison of my results to those of cdsousa’s tests (linked in my OP) shows that this makes a certain amount of sense. That said, I think you are right in that a more microscopic approach, identifying the exact costs of all parts of the algorithm, might be more helpful. That requires, though, that I think a bit more about what actually is going to be needed…

1 Like

This is kind of what I was getting at. Your examples have a lot of code difference so its hard to say where slowdowns come from, or really comment on the relative merits of the strategies without getting out a profiler and comparing line by line.

1 Like

It would be better to benchmark separately the initialization and the actions. Paricularly, with respect dynamic dispatch, there are caches that help to speed it up. Therefore, you should also try to benchmark what happens when methods are not in cache (by having much more agents and actions)