Performance: Building directed graph of game scores (0,0) -> (0,1) etc

I’m building a directed graph of possible game scores from (0,0) up to a score of N, together with a dictionary pointing from the score to the corresponding node.

I’d like to make it as efficient as possible. Currently for N = 1000 it’s taking about 1.8s to run on my i7-8550U (1.8GHz) machine.

I’d like to know if this is a reasonable speed, or I could squeeze more out of it, and if so, how.

I’m using recursion to create this; using a while-loop together with a queue slightly slows it down. Here’s the complete code:

using StaticArrays
using BenchmarkTools

const NEXT_SCORE_CONSTANT = [SVector(0,1), SVector(1,0)]
const END_SCORE = 1000

struct MyNode{T}
       data::T
       children::Vector{MyNode{T}}
       parents::Vector{MyNode{T}}
       MyNode(data::T,children=MyNode{T}[],parents=MyNode{T}[]) where T= new{T}(data,children,parents)
end

function constructorProcessNode!(score::SVector{2,Int64}, node::MyNode{SVector{2,Int64}}, nodeDict::Dict{SVector{2,Int64},MyNode{SVector{2,Int64}}})
    if (score[1] < END_SCORE) & (score[2] < END_SCORE)
        for j in NEXT_SCORE_CONSTANT
            nextScore = score+j
            if haskey(nodeDict, nextScore)
                previouslyCreatedNode = nodeDict[nextScore]
                push!(previouslyCreatedNode.parents, node)
                push!(node.children, previouslyCreatedNode)
            else
                newNode = MyNode(nextScore,[],[node])
                push!(node.children, newNode)
                nodeDict[nextScore] = newNode
                constructorProcessNode!(nextScore, newNode, nodeDict)
            end
        end
    end
end

function constructTree()
    nodeDict = Dict(SVector(0,0) => MyNode(SVector(0,0)))
    constructorProcessNode!(SVector(0,0),nodeDict[[0,0]], nodeDict)
    return nodeDict
end

@benchmark constructTree()
BenchmarkTools.Trial: 
  memory estimate:  477.79 MiB
  allocs estimate:  6006056
  --------------
  minimum time:     1.646 s (23.85% GC)
  median time:      1.765 s (29.88% GC)
  mean time:        1.789 s (30.73% GC)
  maximum time:     1.958 s (37.29% GC)
  --------------
  samples:          3
  evals/sample:     1

My initial reaction to this is that it’s really not clear why you need to do this at all. After all, if the goal is just to be able to figure out what the parents and successors of a given score is, why not compute those on-demand? The computation is trivial, since the successors of score (n, m) are just [(n + 1, m), (n, m + 1)], so why are you trying to store all of those ahead of time? In doing so, you end up needing to allocate literally millions of Vectors, so it’s not surprising that that takes some time.

If you actually need this data structure for some reason, then you could change the children and parents fields to be fixed-size vectors, since there can never be more than two of either (although you’d need to figure out what to put in those vectors for nodes with only one parent or child). But it’s hard to recommend a clear way to improve the performance of the code when it feels like the computation you’re doing isn’t necessary to solve the problem at hand.

@rdeits I’ve built this as a toy example for more complicated games, where the scoring is complicated enough for it to be simpler to do it this way (i.e. tennis, with all its rules about who serves when and tiebreaking sets).I know I’m allocating millions of vectors, but my processor can do billions of operations a second, which is why I’m wondering if I can get more performance out of it.

Hm I’ve translated to C++ and only got a 50% speed boost, I guess my Julia code should be OK then? …

From a performance standpoint what is killing you is the 6 million memory allocates. So anything you can do to reduce the number of allocations is a win. Since your scores go from (0-999, 0 - 999) my initial reaction would be to allocated an array from 1 - 1000000 to store the MyNode objects in then reference them by (score[0] * 1000 + score[1] + 1) you can remove a million allocation at least, you may need a second temporary array to indicate which have been initialized and which have not.

Next if you could predict the number of children and/or parents and allocate those arrays when MyNode is created, that would reduce the allocations more…not sure if it’s possible…but if it is, you could get the allocations down to the minimum of about 2 million allocations.

Not sure if the recursion is causing allocations but you could remove the recursion and create 2 loops, that should remove the need for a temporary array, since you can calculate if an element has been allocated or not…

Edit: Sorry you could use a 2d array also, I come from a background where 2d arrays meant an allocation for each row, so I would “fake” the 2d for a single allocation…

One more edit, if the result is always the same each run, then you could calculate it once, safe it to disk and reuse it each time…that’s probably the fastest :slight_smile: and has the benefit that you don’t really care how long it takes to calculate…

It looks like what you’re trying to build here is a graph where the graph vertices are game scores and the graph edges are game moves. Technically this would be a DAG (directed acyclic graph) rather than a tree.

As you’ve correctly observed something seems wrong about the performance here! Basically you should generally avoid these kind of textbook linked data structures if you care about performance. As others have observed they’re really hard on the allocator because they allocate so many small objects; they can also be pretty bad for memory locality and total size.

Instead, you can record your graph structure (vertices and edges) using arrays and indices, and keep track of the data attached to each node in a separate array. This probably seems complicated and confusing, but luckily there’s LightGraphs.jl which lets you do this in a fairly straightforward way.

To avoid nearly all allocation I tried constructing flat lists of vertices and edges (which you can later turn into a graph). I was a bit disappointed with the performance though, so I’m starting to think my guesses weren’t correct here.

Here’s what I came up with so far (runs in about 0.5 s on my machine).

using StaticArrays

const Score = SVector{2,Int16}

struct NodeData
    score::Score
    # add more data here ...
end 

function construct_score_graph()
    initial_score = Score(0,0)
    
    score_increments = [SVector(0,1), SVector(1,0)]
    end_score = 1000
    VertexIndex = Int32
    
    vertex_data = NodeData[NodeData(initial_score)]
    edges = Pair{VertexIndex,VertexIndex}[]
    #sizehint!(edges, 2000_000)
    #sizehint!(vertex_data, 2000_000)
    
    initial_vertex = VertexIndex(1)
    score_to_vertex = Dict(initial_score => initial_vertex)
    sizehint!(score_to_vertex, 2000_000)

    worklist = [(initial_score, initial_vertex)]
    while !isempty(worklist)
        score, vertex = pop!(worklist)
        if (score[1] < end_score) && (score[2] < end_score)
            for inc in score_increments
                next_score = score + inc
                next_vertex = get(score_to_vertex, next_score, nothing)
                if next_vertex === nothing
                    push!(vertex_data, NodeData(next_score))
                    next_vertex = length(vertex_data)
                    push!(score_to_vertex, next_score=>next_vertex)
                    push!(worklist, (next_score, next_vertex))
                end 
                push!(edges, vertex => next_vertex)
            end 
        end
    end 
    (edges, vertex_data)
end

edges, vertex_data = construct_score_graph()
@time construct_score_graph()

Edit: the dictionary which is used here is length ~1000_000, so the following is a useful performance baseline:

function dictsum(N)
    d = Dict{Tuple{Int,Int},Tuple{Int,Int}}()
    for i=1:N
        d[(i,i)] = (i,i)
    end
    s = 0
    for i=1:N
        v = d[(i,i)]
        s += v[1] + v[2]
    end
    s
end

@time dictsum(1000_000)

This shows that dictsum takes ~0.23 s on my machine, so we’re actually not far away from optimal already, unless there’s a good alternative to using the dictionary of pairs.

It turns out that a lot of the cost here is in the dictionary lookup, particularly hashing. Taking some of that away by using a custom key for the dictionary makes it run in 0.2 s or so, and further using an array + perfect hash, the following version runs in ~0.085 seconds:

using StaticArrays

const Score = SVector{2,Int16}

struct NodeData
    score::Score
    # add more data here ...
end

function construct_score_graph()
    initial_score = Score(0,0)

    score_increments = [SVector(0,1), SVector(1,0)]
    end_score = 1000
    VertexIndex = Int32
    
    # perfect hash of score, given limited size end_score
    score_id(score) = 1 + Int(score[1]) + Int(score[2])*(end_score+1)
    
    vertex_data = NodeData[NodeData(initial_score)]
    edges = Pair{VertexIndex,VertexIndex}[]
    #sizehint!(edges, 2000_000)
    #sizehint!(vertex_data, 2000_000)

    initial_vertex = VertexIndex(1)
    score_to_vertex = zeros((end_score+1)^2)
    score_to_vertex[score_id(initial_score)] = initial_vertex

    worklist = [(initial_score, initial_vertex)]
    while !isempty(worklist)
        score, vertex = pop!(worklist)
        if (score[1] < end_score) && (score[2] < end_score)
            for inc in score_increments
                next_score = score + inc
                id = score_id(next_score)
                next_vertex = score_to_vertex[id]
                if next_vertex == 0
                    push!(vertex_data, NodeData(next_score))
                    next_vertex = length(vertex_data)
                    score_to_vertex[id] = next_vertex
                    push!(worklist, (next_score, next_vertex))
                end 
                push!(edges, vertex => next_vertex)
            end
        end 
    end 
    (edges, vertex_data)
end 

edges, vertex_data = construct_score_graph()
@time construct_score_graph()

At this point I may have used too much of the structure of the toy problem, so it’s probably time to stop :slight_smile:

7 Likes

@pixel27 @c42f Thanks a lot guys! The main reason I wanted to know how to optimize it is because I want to learn Julia properly and I had a nagging feeling I was doing something wrong, but I guess nothing massively. But yeah, it’s a one time operation, so even if it takes longer it’s OK, but the next parts of my program will be more computationally intensive but will use what I’ve built here, so as long I’m not doing anything wrong, then the next part will make use of the linked nodes, which should be really fast (fingers crossed).

But it’s good to know that the dictionary lookup eats a lot of time. Also that using Int16 and Int32 instead of Int64 actually speeds up things (that I found surprising), and also that sizehint! is helpful.

I also tried looping with a queue instead of recursion, but it was actually slightly faster with the recursion, so I left it like that.

To be clear, Dict is quite amazingly fast when used with types which are cheap to hash; it’s a great simple and robust default. Replacing it in the code above was interesting but I’d say it’s a questionable performance hack unless you really need the speed. You might find the following thread interesting:

The speedup here was relatively minor, but if you have a large dataset and you’re limited by memory bandwidth it can make a huge difference to use smaller width integers. Another related reason to use narrow types would be if your inner loops are naturally data-parallel. In that case the compiler’s SIMD vectorizer might kick in, giving you a speed boost depending on how many elements can fit into your SIMD register at once.

IIRC I got a small benefit out of removing the recursion once I’d optimized some of the other things. I didn’t check in the final version though.

Using sizehint! to avoid rehashing the Dict helped quite a lot. You can sizehint! the arrays as well but that made less of a difference (hence being commented out).

So actually this is the part which you really want to optimize :slight_smile: I’d suggest giving LightGraphs.DiGraph a go by feeding it an edge list approximately like the one I’ve built above. I peeked at the light graphs internals and it doesn’t seem to lay out the data quite how I was imagining (ie, in a few big contiguous arrays). Perhaps there’s a different “immutable graph” type you could use which does that. I don’t know a lot about that part of the ecosystem mind you.