Why it would be better to have data related to nodes far away in memory from the nodes themselves?
There are really two primary reasons for this:
If you’re not using the “data related to nodes” (let’s call this metadata for now) as part of your graph operations, then yes - as you suggest, cache locality can be improved in many cases by simplifying the data passed around within the operation. As an example, consider an algorithm that processes outneighbors of a given vertex. For
LightGraphs.SimpleGraph, this is simply a (sorted) vector of
Int (by default; the eltype can be as small as
UInt8) which can fit into cache in all but the largest / most pathological cases, so when you’re accessing the elements in the vector it’s very quick. If you’re using a larger data structure that can’t fit into cache, you will take a performance hit.
The other reason is historical: that is,
LightGraphs started with only one graph type, a
SimpleGraph that could not store any metadata for nodes or edges at all. This approach results in very fast memory-efficient graph operations at the expense of flexibility. We justified this for the performance gains and the facts that 1) it’s fairly easy to create external data structures to hold the metadata you need, and 2) metadata generally plays no part in the graph operations folks tend to use most often. With
MetaGraphs.jl, we’ve created a generic way to associate metadata with graphs while keeping the graph performance of
SimpleGraphs. The current significant drawbacks are 1) increased complexity, 2) type instability in the metadata dictionaries, 3) increased memory usage, and 4) the likelihood of bugs due to 1). There’s also been an attempt to interface with
JuliaDB which may be of interest.
Is this related to the current julia status about supporting mutation of immutable types in mutable containers?
Not directly, as far as I’m aware. At least we haven’t consciously worried about this issue when designing the graph structures.