Improving performance of network sampling/inference algorithm

Hi,

I’m trying to write a Julia package that does roughly what R’s ergm package does: fit, simulate and diagnose exponential-family models of networks. R’s ergm is an excellent and a very mature package, and I’m definitely not trying to port all of it; I’m aiming for a blend of different functionalities from the ergm ecosystem that would suit my needs (and hopefully the needs of other Julia users interested in network modelling). The main motivation is “hackability” - adjusting ergm to my needs was difficult, and I want something that would allow me to extend/tweak existing functionality in a high-level language without compromising performance.

I have an initial working version here, and so far I wasn’t able to get close to the performance of R’s ergm sampling/inference algorithms (written mostly in C); I’ve read several performance guides that were previously suggested on this forum, and tried to use different macros to get additional speedups, but I’ve reached a point where I would really appreciate feedback from more experienced Julia users (I’m quite new to the language).

In case anyone’s interested:

  • base.jl contains the main data structure, and various functions to compute graph and ERGM change statistics.
  • inference.jl contains an implementation of the double MH sampler from Liang, 2010.
  • sa.jl contains an implementation of the Stochastic Approximation algorithm from the appendix of Snijders, 2002.
  • Notebooks contain various experiments and comparisons with R’s ergm using RCall.

Most of this work was already done by @opera_malenky, who previously discussed the data structure on this thread; I’ve added the sa.jl file and other small changes.

I’d be happy to provide further details regarding the code/algorithms/anything else that might be helpful.

2 Likes

Although I wrote some of that early code for research, and didn’t focus on optimization too much, I did plenty of benchmarking while testing different things.

One thing I do recall is that both that code and statnet use the same approach (IIRC) for getting the overall subgraph counts (toggling a tie “off”, counting how the network statistics change, going to the next tie, repeat). I found that the Julia code was much faster than statnet at just getting a simple count of graph statistics (see the subgraphcount function here).

But for some reason, my original code was always significantly slower than statnet at generating new random graphs (which are an important part of most of the ergm estimation algorithms out there), which would be the rgraph function in the repository.

But, of course, the algorithms for that original code and what statnet uses (by default) for generating random graphs are different – statnet uses a TNT (tie-no-tie) sampler, as opposed to just naively iterating over each edge – and I never quite compared how long it took the two sets of code to generate new graphs that were sufficiently uncorrelated with the original graph, so it may be that simply comparing the raw time to generate X number of new graphs was a misleading metric… however, I never really dug into that.

Another aspect is that statnet’s network objects use (IIRC) an edgelist representation, instead of a matrix-based representation. I did that because the code is dominated by finding/changing specific edges. Those operations are very fast in a matrix, but a lot slower in an edgelist; however, it could be that something about the way statnet generates random graphs is much more amenable to an edgelist format (I tend to doubt it, but it’s a possibility).

Anyway, I’m especially curious if anyone sees ways to speed up either the rgraph or change_scores functions.

I’m reasonably sure that most of the various delta_ functions for particular subgraph change scores are about as fast as they could be. However, this is where the algorithm spends most of it’s time, so if anyone does see a way to speed those up – especially the _ttriple and _ctriple functions, which are the most computationally expensive – it would be a big boost.

In case you don’t know it yet, I recently stumbled over this paper, which has some information on the internal graph representation used by the ergm package (section 4). Apparently they use nodewise edgelists separated for incoming and outgoing edges in some binary tree format. I’m not sure how that in theory compares on computing statistics and edge access to something like LightGraph’s adjacency list model.