Trying to identify possible optimizations (or errors) in a graph algorithm

I am trying to re-implement (in Julia) some old C++ code I wrote a while ago to do certain operations on graphs, however, I’m not confident the performance is as good as I remember it… I was hoping someone might be able to take a look at the MWE below and let me know if they see anything out of place, bad form, etc.

To briefly describe the code, I have an algorithm that is dominated by access, searches, edge additions, and edge removals on a graph. I need to loop over every single dyad in a graph multiple times - each time performing a set of calculations on the graph where we compare a graph with the tie to one without.

using BenchmarkTools, LightGraphs


# Change in number of edges when toggling a particular edge
function delta_edge(G, i, j)
    @inbounds if i == j
        return 0.0
    elseif G[i,j]
        return -1.0
    else
        return 1.0
    end
end


# Change in number of mutual dyads when toggling an edge
function delta_mutual(g, i, j)
    @inbounds if i == j
        return 0.0
    elseif !g[j, i]
        return 0.0
    elseif !g[i, j]
        return 1.0
    else
        return -1.0
    end
end


# Change in number of transitive triples when toggling an edge
# For a given dyad (i,j), loop over all third parties and see how many
# such triples are created/destroyed.
# This scales badly with graph size (expected), but should still be a bit better.
function delta_ttriple(g, i, j)
    @inbounds if i == j
        return 0.0
    else
        c = 0
        for k in 1:size(g)[1]
            c += (g[j, k] && g[i, k]) + (g[k, j] && g[i, k]) + (g[k, j] && g[k, i])
        end

        if g[i,j]
            return -convert(Float64, c)
        else
            return convert(Float64, c)
        end
    end
end


# Mutating function to toggle an edge and return a small vector of graph change statistics
function toggle!(x, g, i, j)
    @inbounds x[1] = delta_edge(g, i, j)
    @inbounds x[2] = delta_mutual(g, i, j)
    @inbounds x[3] = delta_ttriple(g, i, j)
end


# Function to just iterate over all dyads in the graph, but only do "heavy" 
# calculations for ones that have edges. This doesn't do anything, but uses 
# the same basic operations as the "real" function.
function partial_single_loop!(x, g)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            !g[i, j] && continue
            toggle!(x, g, i, j)
            if rand() < 0.5
                g[i, j] = !g[i, j]
            end
        end
    end
end


# Iterate over all pairs of vertices in the graph
# Since the graph has a density of ~0.05, this means performing
# about 20x the number of operations than the previous function,
# on average.
function full_single_loop!(x, g)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            toggle!(x, g, i, j)
            if rand() < 0.5
                g[i, j] = !g[i, j]
            end
        end
    end
end


# Loop over the graph multiple (K) times
# If K = 5, I'd expect to see ~100x slower speed
# than the first function, on average, and ~5x
# slower than the second function.
function multi_loop!(x, g, K)
    n = size(g)[1]
    @inbounds for k in 1:K
        for j in 1:n
            for i in 1:n
                toggle!(x, g, i, j)
                if rand() < 0.5
                    g[i, j] = !g[i, j]
                end
            end
        end
    end
end


# Generate some test data and benchmark
g1 = erdos_renyi(200, 0.05; is_directed = true)
a1 = convert(Array{Bool, 2}, collect(adjacency_matrix(g1)))

x1 = zeros(3)
@btime partial_single_loop!($x1, $a1)
x1 = zeros(3)
@btime full_single_loop!($x1, $a1)
x1 = zeros(3)
@btime multi_loop!($x1, $a1, 5)

#= 
Timings:
(1) ~18us to cover ~2000 edges
(2) ~14ms to cover ~40k dyads
(3) ~72ms to cover 40k dyads 5 times
=#

None of the functions allocate.

My main interest is in getting the “multi_loop” function much, much faster - if possible.

When I benchmark the version where I just loop over the edges of the graph, it’s very, very fast (exactly as expected). However, moving to cover the full set of dyads scales badly. Naively, if the first function takes ~ 18us to do 2k operations, it seems odd that the last function should take 72ms to do 200k of the same operations. (Although at least the scaling seems about right going from the second to the third function.)

Any way, if anyone has suggestions for this sort of code, let me know. It’s not slow, but it’s reaching the borderline of usability, in what is a pretty small graph; I want to make sure I’m not doing something wrong.

(FWIW, I tried using LightGraphs first; it was a bit slower for small graphs and much faster for large graphs, as expected… as long as I didn’t need to remove an edge in the algorithm. As soon as I did that, performance tanked by a couple of orders of magnitude…)

I can’t comment on your code, but if you’re looking for the fastest removals in the JuliaGraphs ecosystem, I’m afraid SimpleGraphs provide them.

In general, the tradeoff we’ve made is for memory efficiency and fast accesses over fast modifications. This is generally because for small graphs, even slower things are “fast enough”, and for large graphs, you need to be able to fit the graph into memory, and usually graph analysis (as opposed to graph modification) is the order of the day. Yours might be an edge case, as are cases where you’re doing many modifications to a small graph.

If you’d like to build a graph backed by a dense matrix (which will give ~O(1) edge additions but will be O(|V|^2) memory, all you’d need to do is implement the dozen or so functions specified in interfaces.jl and everything else should “just work”. You could probably make some trivial modifications to SimpleWeightedGraphs to do just this.

1 Like

Thanks - as a matter of fact, that’s what I’ve done in my “real” code (e.g., build a simplegraph type backed by a dense matrix), using the example from the recent LightGraphs presentation (thanks!). I just was keeping it simple to provide as “minimal” a MWE as possible. :wink:

My main issue, captured in the code above, is that performing my set of calculations over just the pre-existing edges is really fast (for both the dense matrix class and the standard simplegraph class). Since it’s really fast looping over 5% of the dyads (the network has a density of .05), I would expect that doing a single loop over all the dyads would make it maybe ~20x slower.

I did a little more testing, and that is exactly what I see when I just do the “delta_edge” function (which is a simple per-dyad lookup) - increasing the number of toggles by x20 also requires about x20 the time. I see the same pattern (a linear increase in time) when I add the “mutual” term to the toggle function.

Some additional testing of the code above (and my real code) seems to show that where everything goes to crap is specifically the inclusion of triple term. When I add that, instead of x20 (linear) scaling, it takes over 1000 times longer (in a 500 node network of density .05). I also confirmed that in my original code – something about the triple term is scaling VERY badly.

Now, that’s the most “expensive” function call I have. And I expect that term to scale very very badly with the size of the network. But for a network of fixed size (my case here), the cost of any given call to the “triple” function should be about the same for any particular dyad, and I’d expect linear scaling with the number of dyads it’s calculated across. And FWIW, I do see linear scaling for triples for multiple full passes over the network, relative to a single full pass.

But there’s something specifically with the triple function that causes the move from a partial loop (over dyads with existing ties) to a single loop (over all dyads) to have that nasty superlinear slowdown. And I can’t see why…

At this point, I’ve played with multiple alternative algorithms for the triple function, using both standard arrays and adjacency lists (i.e., standard simplegraph types), and I see the same effects no matter what. Everything is very fast for a single dyad, or the set of dyads that have edges, but then there’s a dramatic slowdown when applied to all dyads.

I’ve managed to get a speedup of about 30% on your data with the following code:

function delta_ttriple(g, i, j)
    if i == j
        return 0.0
    else
        n = size(g)[1]
        c = zero(Int)
        @inbounds for k in Base.OneTo(n)
            x1 = g[j, k]
            x2 = g[i, k]
            x3 = g[k, j]
            x4 = g[k, i]
            if x1 && x2
                c += 1
            end

            if x3 && x2
                c += 1
            end

            if x3 && x4
                c += 1
            end
        end

        cf = Float64(c)
        if g[i, j]
            cf *= -1
        end
        return cf
    end
end

As for the overall slowness of this function, my suspicion is that it’s cache locality issues since you’re accessing different parts of a large array.

First, consider whether you really want to use floating point numbers, or rather integers.

Then: How do you store your graph? What are the kind of problems you have? Many small and dense graphs with easy questions? Few small and dense graphs with NP-complete questions? Large and sparse graphs?

For medium-small dense graphs, use two bitmaps: One for g, one for gT (slightly redundant, but faster operations). You need to add count(gT[:, j] & gT[:,i]) + count(g[:,j] & gT[:,i]) + count(g[:,j] & g[:, i]); but thanks to the packing, you can do 64-256 entries in the same instruction.

Afaik there is no sensible package for bitmap-represented di-graphs out there. SimpleGraphs is optimized for large sparse immutable graphs.

I am personally only dealing with undirected bitmap graphs; that is simply a Matrix{UInt64}(undef,Base.num_bit_chunks(n) , n+1), where I use the last row for a mask of active vertices (sometimes I need to delete vertices), and a BitMatrix that is wrapped around this (reshape the matrix into a Vector, thats your chunks). The BitMatrix is for convenience access functions, while the Matrix is for convenient chunked access. Depending on context, also make a wrapper that encodes the number of chunks (such that b[:,i] can return a SVector or NTuple of UInt64). I need to process a lot of cliques with certain properties; hence, the occasional type-instability can pay off by specializing all loops for NV = Val(Base.num_bit_chunks(n)), especially since NV < 5 is not uncommon for my data.

Unfortunately, BitMatrix does not align the rows; hence, bm[:, i] & bm[:,j] cannot be computed quickly in the Base.BitArray layout.

If there are other people here interested in bitmap-based graph algorithms, maybe we could join forces?

You might take a look at StaticGraphs, which does precisely this but with a pair of sparsevecs as opposed to bitmaps. A bitarray-backed graph would be horribly slow both on inserts and on accesses.

What?

Accessing bm[i,j] for bm::BitArray is extremely fast. Sure, you need to fix the max number of vertices a priori, and set bm[j, end] if and only if slot j is filled (last row is a mask of active vertices). Deleting a vertex, toggling an edge, or viewing subgraphs is now very fast (change the mask).

However, if if the graph is large and sparse, then the bitmap won’t fit into memory. In an adjacency list you pay at least 32 bits per entry (if you judiciously use Int32 everywhere). If your density >0.01, then a bitmap is certainly a good contender.

In order to enumerate neighbors, I use an analogue of the new logical indexing code for bitarrays, brewed around the BLSR instruction, and compute bm.chunks[i, j] & bm.chunks[j,end] on the fly.

PS. I am using this to compute homology of clique-complexes of small dense graphs that are represented as bitmap, with Base.num_bit_chunks(num_vertices) encoded into the type (use function boundary there). This theoretically includes a horrible, horrible recursion over all cliques, and hence lots of viewing of induced subgraphs. The recursion becomes less horrible (actually tractable, even) because many graphs permit Morse-theoretic shortcuts (deal with all cliques in our current view that contain vertex x in one fell swoop; next, deal with cliques in our current view that don’t contain vertex x, which is again a view).

I modified the above code to work on an adjacency matrix that was backed by a BitArray and the performance as compared with a Matrix{Bool} was ~75x slower. This is consistent with our previous work in using BitVectors to represent data that could also be represented by Vector{Bool}: The BitVector saves memory, of course, but reduces performance of accesses and inserts, even with the improvement in cache locality.

I’m also surprised it’d be so slow. I would have expected better performance with BitArray. You don’t happen to have a minimal code example?

Ok, let’s compare:

julia> function delta_triple(g::Matrix{Bool}, i, j)
           if i == j
               return 0
           else
               n = size(g)[1]
               c = zero(Int)
               @inbounds for k in Base.OneTo(n)
                   x1 = g[j, k]
                   x2 = g[i, k]
                   x3 = g[k, j]
                   x4 = g[k, i]
                   if x1 && x2
                       c += 1
                   end

                   if x3 && x2
                       c += 1
                   end

                   if x3 && x4
                       c += 1
                   end
               end

               if g[i, j]
                   c *= -1
               end
               return c
           end
       end

julia> function delta_triple(M::Matrix{UInt64},MT::Matrix{UInt64}, i, j)
       i == j && return 0
       n_chunks = size(M,1)
       ms = size(M,2)
       c = 0
       checkbounds(M, 1, i)
       checkbounds(M, 1, j)
       @inbounds @simd for k=1:n_chunks
       x1 = MT[k, j]
       x2 = MT[k, i]
       x3 = M[k, j]
       x4 = M[k, i]
       msk = M[k, ms]
       c += count_ones(x1 & x2 & msk)
       c += count_ones(x3 & x2 & msk)
       c += count_ones(x3 & x4 & msk)
       end
       i1,i2 = Base.get_chunks_id(i)
       Mij = !iszero(M[i1,j] & (1<<i2))
       return ifelse(Mij, -c, c)
       end

julia> function mk_chunks(m::Matrix{Bool})
       n = size(m, 1)
       n == size(m,2) || error("matrix wants to be square")
       nchunks = Base.num_bit_chunks(n)
       bm_M = falses(nchunks<<6, n+1)
       bm_MT = falses(nchunks<<6, n+1)
       bm_M[1:n, end] .= true
       bm_MT[1:n, end] .= true
       bm_M[1:n, 1:n] .= m
       bm_MT[1:n, 1:n] .= m'
       M = reshape(bm_M.chunks, (nchunks, n+1))
       MT = reshape(bm_MT.chunks, (nchunks, n+1))
       return (bm_M,bm_MT),(M,MT)
       end


julia> function testfun(args, pairs)
       c = 0
       for (i,j) in pairs
       c+= delta_triple(args..., i, j)
       end
       return c
       end
julia> using BenchmarkTools
julia> n=1000; m=rand(Bool, n, n); pairs = [(rand(1:n), rand(1:n)) for i=1:10_000];
julia> _, (bm, bmT) = mk_chunks(m);

julia> testfun((bm,bmT), pairs) == testfun((m,), pairs)
true

julia> @btime testfun((m,), pairs);
  50.432 ms (2 allocations: 32 bytes)

julia> @btime testfun((bm,bmT), pairs);
  455.779 μs (2 allocations: 48 bytes)

By chunking, we gained a factor 100 speedup, saved a factor 4 of space and obtained a new feature (masking of vertices).

I’d like to stress that it is absolutely essential for speed to store both M and its transpose, in the case of directed graphs. No lazy transposes.

2 Likes

That’s interesting – I have a project where being able to easily subsample sets of nodes from a larger network could potentially be useful, and a common task in the sorts of models I’m working with is to do separate calculations across nodes with different colors (e.g., with a common attribute). I suppose your masking approach might be able to do that efficiently?

I need to do more benchmarking… but generally:
(1) I re-ran some of my old C code, and it’s not quite as fast as I remembered for largish (>200) graphs. The speed I’m getting for “atomic” graph operations (e.g., the toggles) are broadly in line with what I was getting for graphs of the same size in my previous code. That’s both reassuring and a bit disappointing (because I really thought I remembered it being pretty fast up to 200 nodes or so).

(2) I am still a bit confused by some of the benchmarks I run (will try to post better mwe later). Sometimes it scales properly with the number of dyads (for a fixed graph size), sometimes it doesn’t. I’m having trouble narrowing down the issues. Overall, even when I get a good speed on doing basic ops for graph edges, when I move to iterating over the entire graph multiple times, performance is just… not good.

(3) As you can probably tell, I’m trying to implement an ERGM in Julia; it’s one of the things I use in my research. I’d love to be able to at least get close to statnet/R’s performance in terms of generating random graphs. So far, (using the R microbenchmark package) I beat statnet easily in terms of getting the total number of subgraph statistics (which involves a single run across actual edges), but either lose badly in terms of generating random graphs… or lose by a little.

But I have trouble determining when that happens. My testing so far (in my real code) is inconclusive; sometimes the operations over the graph take ms, sometimes us, and it’s hard to see why I see one or the other.

So, more testing… argh!

Still, I REALLY appreciate the discussion so far! I don’t know that I’m super-qualified, but I’d love to be able to have a package that handles various graph motifs (foobar, it sounds like this is an area you may be working in a bit). If anyone has advice, or if I can help contribute, please let me know. I think (long term) Julia would be a much better platform for statistical network research than R.

Matrix{Bool} is not the same as BitMatrix. The former stores 8 bits per element; the latter, 1 bit packed. What am I missing here?

Earlier in the thread, I bemoaned the absence of bitmap (one bit per element packed, plus padding of each row to multiple of 8 bytes)-backed graph theory packages, and you appeared to argue that Matrix{Bool} tends to fare better in practice.

I am arguing that properly aligned bitmatrices are better than Matrix{Bool}, and demonstrated that by showing how delta_triples gets 100x faster with 4x less space consumption this way (for directed graphs, adjacency matrix and its transpose both).

Obviously such speedups are pretty situational: This works only for questions that want to be a circuit, instead of code, because circuits can be evaluated in 64bit chunks (that get simd to 128 or 256 bit). However, quite a lot of questions can be worded as broadcast over the chunks of the bit-packed adjacency matrix.

sbromberger is the master of the julia graph theory ecosystem.

My experience with graph motifs and graph-isomorphism solvers is approximately nil.

However, if there existed a sensible way/algorithm for a Dict based on small bitmapped graph modulo isomorphy, then I’d be super happy. And by small I mean tiny, <65 vertices.

I would use this for memoization: In my terrible, terrible clique/homology problems, I use a mix of reduction/decomposition (nice, polynomial time) and if that fails, recursion (exponential time). Now the same irreducible subgraphs appear again and again; might even be possible to have a full table up to ~20-30 vertices. Nobody else looked at my notion of “irreducibility” yet, but I can’t really run numerical experiments without a reasonably fast graph isomorphism solver / lookup-table for tiny graphs.

And it does, at least in my experience. Here it is on the code above.

Original code on my machine (dense matrix of Bool):

  31.260 μs (0 allocations: 0 bytes)
  20.719 ms (0 allocations: 0 bytes)
  121.067 ms (0 allocations: 0 bytes)

changing the struct to a BitMatrix via a1 = BitMatrix(adjacency_matrix(g1, Bool)):

  60.865 μs (0 allocations: 0 bytes)
  25.277 ms (0 allocations: 0 bytes)
  130.481 ms (0 allocations: 0 bytes)

This comports with our experience in using BitMatrices and BitVectors in LightGraphs: where memory usage is a concern, they provide an advantage, but at the expense of slower accesses/modifications.

Edited to add: the performance speedup you achieved is using custom implementations of packed data structures as opposed to the native structures in Base. This, too, mirrors my experience when I implemented graphs in Go: we made a custom bitvector implementation faster than a vector of boolean (primarily due to cache locality advantage, I think, but I could be wrong here). But this is not what’s been implemented in Base, and we have not yet tried implementing our own version of BitMatrices in LightGraphs.

Edit #2: my guess (untested) is that your speedup may be the result of using the transpose where it makes sense to do so. But this is just conjecture at this point.

1 Like

This is typical of data accesses that do not take advantage of cache locality, especially when this happens after the data structures have grown to a particular size.

So we are in agreement then: Naive use of BitArray instead of Matrix{Bool} is useless.

Instead, somebody would need to write an entire package that implements both a custom bitmap datastructure and convenience functions (like random access to entries) and a standard suite of algorithms, such that they make use of branchless 64-512 entry wide operations on the adjacency matrices (g and transpose(g)). There is no generic auto-vectorization that can get us from Bool to packed UInt64; every single algorithm that wants to be fast on bitmaps must be rewritten in order to make use of the data-structure.

Hence the need for a package that unfortunately does not exist yet :frowning:

But lots of stuff is possible. For example, a mostly branchless BFS iteration is super easy and fast! Likewise, bipartite max-matching should be pretty fast as well; but no idea how to do non bipartite max-matching via simd instructions. Well, the hypothetical BitmappedGraphs.jl does not need to supersede the Matrix{Bool} approach, but instead complement it for the case of problems that support simd, just like Matrix{Bool} complements the sorted adjacency list for the case of medium-small dense graphs.

PS. I think this specific case is not just about cache-locality, but instead about handling 64 entries with a single branchless bitwise instruction.

1 Like

We appear to be in violent agreement.

I haven’t worked with these structures to comment on the accuracy of this statement, but if it’s true it makes me sad.

I think it’s safe to say we would heartily welcome such a package in the JuliaGraphs ecosystem. :wink:

That’s a much more reasonable explanation for the speedup.

OK, although I suspect I may be at the limits of what I can realistically expect, I still have one nagging question about the performance of my code, and I’ve identified a peculiar performance case.

Basically, when I run my code doings ops over JUST the edges, it’s very nice and fast in all cases. Wonderful! However, as soon as I alter the loop so that it either (a) runs over ALL dyads (not the edges) or (b) does some sort of test (like rand < p) to have a probability of performing the op on that edge, the performance is quite a bit worse than I would expect.

In case (a), it’s doing the same ops, but over 20 times the number of dyads as the baseline test (for a graph with 5% density). In case (b), it’s doing the same number of ops (in my testing, of 5% of the dyads) as the baseline (but obviously w/ the overhead of the rand() and “<” test, which are both miniscule). However, both those cases produce VASTLY worse results than just testing over edges.

The first case (a) is doing 20x the number of ops, and I’d expect it to scale roughly linearly. The second case (b) is doing the same number of graph ops, plus the slight overhead of the rand() call and comparison test. I’d expect (b) to be about the same order of magnitude as the baseline edge test, plus some extra overhead.

This is exactly the pattern I see when I test the “delta_edge” or “delta_mutual” operations. However, as I mentioned before, the triadic operations show some sort of really bad scaling.

For example, in the code below, if I do the triple operation over just the edges, it’s about .12 ms in a 500 node graph with density of 0.05. That’s perfectly fast. However, if instead of testing a dyad to see if an edge exists I select a dyad to toggle with some probability p, the performance goes to 156 ms - that’s 1000x worse. A function that does NO testing, but simply loops over all dyads scales linearly from that - 760 ms.

At this point, I just have to guess it’s some cache or compiler optimization issue, but it’s frustrating when I expect to see performance (for doing ops over all dyads in a 500 node network) in the sub 30ms range, but instead I’m seeing 750+ms.

using BenchmarkTools, LightGraphs

function delta_edge(G, i, j)
    @inbounds if i == j
        return 0.0
    elseif G[i,j]
        return -1.0
    else
        return 1.0
    end
end


# I have no scaling issues with 
# getting the number of edge changes
# or reciprocal dyad changes, or both.
# They scale exactly as expected.

function delta_mutual(g, i, j)
    @inbounds if i == j
        return 0.0
    elseif !g[j, i]
        return 0.0
    elseif !g[i, j]
        return 1.0
    else
        return -1.0
    end
end


# FWIW, I see the same issues with different versions of
# the code for transitive triples. Doing op over just edges, 
# inside "toggle" function is FAST. Anything else scales 
# badly.

function delta_ttriple(g, i, j)
    @inbounds if i == j
        return 0.0
    else
        c = 0
        for k in 1:size(g)[1]
            c += (g[j, k] * g[i, k]) + (g[k, j] * g[i, k]) + (g[k, j] * g[k, i])
        end

        if g[i,j]
            return -convert(Float64, c)
        else
            return convert(Float64, c)
        end
    end
end


# For a given dyad, record how the graph changes when 
# the tie is toggled on/off. Comment/uncomment to alter 
# which change statistics are being calculated for 
# benchmarking.

function toggle!(x, g, i, j)
    # @inbounds x[1] -= delta_edge(g, i, j)
    # @inbounds x[2] -= delta_mutual(g, i, j)
    @inbounds x[3] -= delta_ttriple(g, i, j)
end


# This loops over all dyads, and if it's an edge, 
# we toggle it off and record how the graph statistics 
# defined in the "toggle" function change.

function edge_loop!(x, g)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            !g[i, j] && continue
            toggle!(x, g, i, j)
            g[i, j] = !g[i, j]
        end
    end
end


# This loops over all dyads, and with prob p,
# toggles the edge/on/off and calculates
# the change in graph stats.

function prob_p_loop!(x, g, p)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            (rand() > p) && continue
            toggle!(x, g, i, j)
            g[i, j] = !g[i, j]
        end
    end
end


# This loops over all dyads, with no exceptions.

function full_single_loop!(x, g)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            toggle!(x, g, i, j)
            g[i, j] = !g[i, j]
        end
    end
end


# This loops over all dyads K times.

function multi_loop!(x, g, K)
    n = size(g)[1]
    @inbounds for k in 1:K
        for j in 1:n
            for i in 1:n
                toggle!(x, g, i, j)
                g[i, j] = !g[i, j]
            end
        end
    end
end


# Generate some test data and benchmark
g1 = erdos_renyi(500, 0.05; is_directed = true)
a1 = convert(Array{Bool, 2}, collect(adjacency_matrix(g1)))


# Speed of doing graph op & toggles JUST over 
# the edges of graph

x1 = zeros(3)
a2 = deepcopy(a1)
@btime edge_loop!($x1, $a2) 


# Speed of ops when choosing a dyad with prob p
# When p == density, should be the same order of
# magnitude as the first test above (plus some overhead)

x1 = zeros(3)
a2 = deepcopy(a1)
@btime prob_p_loop!($x1, $a2, $0.05) 


# Loop over all dyads, with no conditional tests
# in a graph of density .05, *should* be ~20x slower
# than the edge_loop function

x1 = zeros(3)
a2 = deepcopy(a1)
@btime full_single_loop!($x1, $a2)


# Iterate over the graph K= 5 times
# *Should* be approx K times slower than the 
# full_single_loop function, and about 
# 20*K times slower than the edge_loop function (assuming
# density of 0.05)

x1 = zeros(3)
a2 = deepcopy(a1)
@btime multi_loop!($x1, $a2, $5) 

Now, the one other data point I have is this: if instead of putting the “delta_” ops inside the toggle function, I just directly code them into the loop, e.g.:

function edge_loop!(x, g)
    n = size(g)[1]
    @inbounds for j in 1:n
        for i in 1:n
            !g[i, j] && continue
            x[3] -= delta_ttriple(x, g, i, j)
            g[i, j] = !g[i, j]
        end
    end
end

… then everything scales “properly” with the number of dyads, but the baseline edge_loop gets slow… but ONLY when the transitive triple count is being done. Somehow wrapping the “delta_ttriple” function inside the “toggle!” functions gives a huge speed boost for getting changes in transitive triads (and I’ve verified the output is correct), but ONLY for loops over edges. It’s very, very confusing.

I may just make do with what I have now, and keep to <200 node graphs, but not understanding the causes of some of these patterns really makes me wonder if I’m doing something wrong. If I could get the speed of the basic edge_loop function to scale linearly with all dyads, it would be excellent, but I can’t tell if that’s unrealistic or not.

And foobar, I am very curious about the method you described (although I’m afraid I’m not qualified to implement it!) Is the masking you describe pretty fast? I’m trying to think ahead to possible ways of subsampling/sectioning dyads from large or moderate-sized graphs efficiently for some projects down the line…

I haven’t had time to mess with the cool suggestion by foobar, but I did benchmark a crude version of a graph struct that includes the transpose. Not worrying about anything else (e.g., chunking), it was still over 2x faster at that part of the algorithm (and about 50% faster in the full algorithm). In my use case, that seems worth the tradeoff in memory.

When I have time, though, I’d like to play with some of these other ideas further. Also, although the basic algorithm I’m using works by toggling ties, you don’t (strictly speaking) have to do them one at a time – you can propose blocks of toggles, or paired toggles, etc. That normally doesn’t do anything for efficiency (and more often hurts it), so nobody does it, but with a paired transpose data structures, I could see toggling pairs might actually give some benefits.