Identifying performant data structures

Hi, I am coding a simulation and have a couple of possibilities for structuring the data. I’d like to sort out which is likely to be the most efficient. I’ve written toy examples for some of the options, but the results have not been super informative.

The model accrues ~100k agents over time, each of which has some mutable and some immutable characteristics. The latter are updated several thousand times during a run. The final number of agents is known prior to run. I don’t have a ton of experience with Julia, so I am wondering if there are options other than those mentioned below that I should be aware of for structuring the data, or details related to some of these approaches that would be good to know about for performance.

Some of the possible strategies, which could be implemented on their own or in combination:

  • Make Agent a mutable struct, and add a const designator to those values that don’t need to change
  • Make Agent an immutable struct and use @reset to change those values that need to
  • Make an immutable struct for the unchanging values, a mutable one for the others, and link/filter them with a unique id
  • Start the run with an empty array of agents, and push agents into it as they enter the model
  • Start the run with a full array of agents and designate some as active as they enter the model
  • Put the whole thing in a dataframe or some other kind of structure.
  • Make an immutable struct for the changing values, and simply add a new element/record to the struct array each time an agent’s characteristics change

Thanks for any knowledge you can share!

I only have experience with very small systems with a handful of objects. But it’s the next day and still no responses, so I’ll offer my advice.

I find that immutable objects are much (much!) easier to work with than mutable. Replacing objects is much easier to organize and reason about than mutation. A Vector, Dict or other suitable container of immutable objects is generally a nice way to keep a lot of things around.

I also find getter/setter functions to be nicer than object.field access. They make it much easier to make changes later (especially changes related to polymorphism). So while @reset is very nice, I would recommend you wrap it in setter functions in most cases.

Mutable objects are best for when you actually require mutable semantics (mostly in data structures) or when you have very large objects (hundreds of bytes, at least) to which you frequently make changes to only small fractions of the data (e.g., matrix factorization algorithms), because sometimes this is noticeably slower than mutation. Hopefully, the compiler continues to get better at the @reset pattern and can efficiently update larger immutable objects as time goes on.

I’ve done some benchmark—if your code is type-stable and you need to modify some fields of a struct, then using a mutable struct is the fastest. An immutable struct combined with @reset is a bit slower.

As mikmoore mentioned, if the fields you need to modify are themselves mutable objects, you can simply use an immutable struct. If the fields you need to modify are immutable objects, then I think this post might help you choose an appropriate data structure.

IMO: write your simulation in the most natural way first. Often that will help you get the data structures right. Then benchmark and see if/where it’d make the most sense to optimize.

6 Likes

If you have lots of different objects, you may want to consider an entity component architecture. But this is probably not the first thing to try if you want to get a working prototype.

Another option is to make Agent an immutable struct and store it in a StructVector from StructArrays.jl to be able to easily modify fields.

If you use sizehint! on the StructVector, push! will be very fast.

A tricky thing is deleting agents if you want the agent data to stay contiguous without any holes (depending on what you are doing, this can be important for performance). @Tortar and I are working on some packages to make that more convenient. These aren’t production-ready yet, but they may give you some ideas. https://github.com/nhz2/UniqueIDs.jl
GitHub - Tortar/KeyedTables.jl: Tables with stable identifiers in Julia

1 Like

I disagree: You need to anticipate early what kind of refactorings may become necessary, and accommodate them beforehand.

To give an example, suppose you start out with a natural

mutable struct Agent
#....
end

and then have lots of internal APIs that operate on agents, very OOP style.

Then you figure out that this sucks: Your Agent objects are all splattered over the heap, have a big fat object header, etc etc, and you’d like to move to a setting where you have

struct Agent
#...
end
agents::Vector{Agent}

Now you are moving towards ECS-style. And you are in for a very very painful refactoring, because all functions

function do_something!(agent::Agent, ...)
#...
agent.someField = someValue
#...
end

must become

function do_something!(world::Vector{Agent}, idx, ...)
agent = world[idx]
#...
world[idx] = modified_agent
#...
end

In other words, the entire philosophy of all your internal APIs changes from “do something with an agent that conceptually exists independently of the world” to “do something with the world, especially with respect to the agent with idx=…”.

This is one of the most painful of all refactors – anticipate whether you will need it, and maybe plan for it.

PS. Another very important consideration that is also incredibly painful to change is whether you want to buffer/batch updates, or do them immediately. The canonical example are sparse matrices: Updating sparsity structure fundamentally needs batching of updates. Turning a codebase that assumes it can write nilly-willy to agents / the world into one that batches all updates, i.e. fundamentally operates on snapshots of the world-state, is a refactoring nightmare!

And if each agent has lots of ::Vector{T} properties that each are very small vectors, then it may well become necessary to use SparseMatrixCSC-style storage which requires batched updates. In database lingo, this is the difference between columnar and row-based storage; switching between these approaches is not seamless.

PPS. But mbaumann is of course correct for your fist prototype that you plan to throw away. But after the throwaway stage, some things are just nasty to change.

2 Likes

My point is that I often don’t even know what APIs I will need my objects to provide when I start, let alone worry about the micro-optimizations to the data structures that power them.

Figuring out the APIs first — yes, definitely with an eye towards avoiding quadratic/exponential/terrible computational complexities — can actually help you understand what the backing datastructures need to be.

And yes, of course that’s not always true.