How to represent a DAG?

data_structures

#1

What is the best way to represent a DAG (Directed Acyclic Graph) in Julia?

I need to have mutable nodes that know who their children and their parents are, so that I can propagate information up and down them. The use case is constraint propagation, which has the same basic structure as reverse-mode automatic differentiation.

What is the best structure for this? The calculations should be as fast as possible.

So far I have something like

mutable struct Node
    value
    left::Union{Void, Node}
    right::Union{Void, Node}
    parent
end

Nodes can have values which are operations like +, variables like x or constants like 2.

Obviously I would like to concretely type the nodes, but I don’t see a way to do that.

Probably the answer is “use Cassette.jl”, but I would like to try doing it by hand…


#2

If you want it to be fast, definitely don’t do the recursive thing. Lay out the values in a flat vector (probably toposorted in the case of a DAG). I’d recommend storing the vertex/edge relationships separately as integer indices, again in vectors.

The real experts are probably the LightGraphs people; I’m sure they can help you out with this. In fact you could just use LightGraphs straight up (perhaps with MetaGraphs to store associated data).

Personally, I rolled my own in an effort to keep dependencies for my package to a minimum (which was more of an issue in the past, when LightGraphs was anything but light in terms of dependencies) and because of some missing features at the time. See https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/src/graphs/directed_graph.jl and https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/test/test_graph.jl. It’s reasonably fast, but I actually stopped using it in performance-sensitive code, partly to improve performance.

If I were to reimplement it right now, I’d do things slightly differently. I thought it would be nice to be able to really treat the data associated with vertices and edges as being identical to the vertices and edges, but that convenience has a performance cost associated with it. In my design for DirectedGraph{V, E}, the vertex type V and edge type E each store an integer index along with the data (your value). Relationships between edges and vertices are stored in types like Vector{V} and Vector{Vector{E}}, which are indexed by these integer vertex and edge indices. This makes it so that graph traversal requires doing a lot of lookups of the integer indices in the (in my case big, heap-allocated) V's and E's, which is terrible for cache locality. So I’d store the vertex/edge relationships using integer vertex indices and edge indices if I were to do a second pass. This would make it closer to MetaGraphs.MetaDiGraph, but instead of storing the data associated with edges and vertices in Dicts, I’d still use flat Vectors indexed by integer vertex index and edge index.

Sorry about the wall of text, I hope that made some sense.


#3

If I remember correctly https://github.com/mlubin/ReverseDiffSparse.jl uses DAGs.


#4

Thanks, that will take a while to process!


#5

I forgot to ask the related question: how to extract the DAG form using Dataflow.jl:

ps://github.com/MikeInnes/DataFlow.jl/issues/14


#6

I’d second the suggestion to see whether LightGraphs could help you here. The advantage is that if you have to do anything graph-like with the DAG, it’s probably already in the package, and (reasonably) optimized.

If you have one value you need to store per node (or edge), then you might just use a SimpleGraph with an external vector mapping vertex to value (and the reverse as a Dict if you need it); otherwise, check out MetaGraphs.jl.


#7

I am interested too. I am unable to get where the trade off is about this (I am not a CS guy). Why it would be better to have data related to nodes far away in memory from the nodes themselves? Let’s assume I access all the data of a node at each node selection and then i pick another one, isn’t it going to always happen more cache misses, by having the data at another address?
Is the trade off behind how cache is associative and how addresses are virtualized?

Is this related to the current julia status about supporting mutation of immutable types in mutable containers?


#8

Why it would be better to have data related to nodes far away in memory from the nodes themselves?

There are really two primary reasons for this:

If you’re not using the “data related to nodes” (let’s call this metadata for now) as part of your graph operations, then yes - as you suggest, cache locality can be improved in many cases by simplifying the data passed around within the operation. As an example, consider an algorithm that processes outneighbors of a given vertex. For LightGraphs.SimpleGraph, this is simply a (sorted) vector of Int (by default; the eltype can be as small as UInt8) which can fit into cache in all but the largest / most pathological cases, so when you’re accessing the elements in the vector it’s very quick. If you’re using a larger data structure that can’t fit into cache, you will take a performance hit.

The other reason is historical: that is, LightGraphs started with only one graph type, a SimpleGraph that could not store any metadata for nodes or edges at all. This approach results in very fast memory-efficient graph operations at the expense of flexibility. We justified this for the performance gains and the facts that 1) it’s fairly easy to create external data structures to hold the metadata you need, and 2) metadata generally plays no part in the graph operations folks tend to use most often. With MetaGraphs.jl, we’ve created a generic way to associate metadata with graphs while keeping the graph performance of SimpleGraphs. The current significant drawbacks are 1) increased complexity, 2) type instability in the metadata dictionaries, 3) increased memory usage, and 4) the likelihood of bugs due to 1). There’s also been an attempt to interface with JuliaDB which may be of interest.

Is this related to the current julia status about supporting mutation of immutable types in mutable containers?

Not directly, as far as I’m aware. At least we haven’t consciously worried about this issue when designing the graph structures.


#9

OH, one other thing I forgot to mention (and I think it deserves its own post which is why I didn’t edit my previous one): given how easy it is to construct your own graph types now in LightGraphs, you might consider doing just that if MetaGraphs is too unwieldy for you.