What is the Pkg.jl package dependency resolver (solver algorithm)?

Yes, I’ve always wanted to write down some detailed notes about this but never got around to doing it. I’ll try to sketch some basics here. Please forgive the unpolished result…

MaxSum is a limiting case of a more well-known message-passing algorithm for statistical inference called Belief Propagation, also known as SumProduct. In order to derive this limiting case, consider writing a probability p(x) as \exp(\beta \phi(x)) and sending \beta \to \infty. The new quantities \phi can be interpreted as “negative energies” in a physical analogy, or “negative costs”/“utilities” in an optimization context. After the limit the Belief Propagation (BP) equations for the probabilities become the MaxSum equations for the log-probabilities-divided-by-\beta, i.e. for the costs. The BP equations (that you can look up on Wikipedia or many other places) change in 3 ways:

  • sums are replaced by maximums
  • products are replaced by sums
  • normalization is performed by shifting (so, by an additive term) rather than by scaling (a multiplicative term)

You can see then that SumProduct becomes MaxSum.

The MaxSum algorithm is not used for statistical inference, but rather for optimization. The BP equations are called iteratively until a fixed point is reached, and from the fixed point one can obtain a parameterization of a probability distribution. In MaxSum, the equations are also called iteratively, but if convergence is achieved the resulting quantities describe the mode of the probability distribution (or the modes if there is more than one, but one normally tries to ensure that a single mode is present, if necessary by breaking ties arbitrarily by adding some tiny random noise to the scores).

Just like BP, MaxSum is only exact when the graph topology on which it runs is a tree. In that case, MaxSum is basically a form of Dynamic Programming. However, BP and MaxSum can also be run over graphs with loops. In that case they are approximate, but if the loops are not “too structured” the approximation may still be excellent, and the algorithm fast (see below). In that case, however, neither BP nor MaxSum are guaranteed to reach convergence. There are heuristic techniques to work around this and drive the algorithm to produce a configuration. The one used in Julia’s resolver is called “decimation”, and it works like this. Perform a number of MaxSum iteration, then pick a few variables of the graph, those which are most “polarized” (those that exhibit the largest difference between their favorite state and their second-most-favorite state) and fix their values. This effectively simplifies the graph (and generally breaks some loops). Then start over on the simplified graph, until all the variables are fixed or a contradiction is detected. A contradiction is manifested in the algorithm when a hard constraint is being violated.

In the Julia’s resolver, this is augmented by a back-tracking mechanism: if a contradiction is detected, we try to revert the last decimation step and pick a different choice for the variables to decimate, and their values.

The representation of the problem in Julia’s solver is as follows: each package is a node of the graph, and it has as many possible states as there are package versions, plus there is one special state which is “uninstalled”. Each pair of packages may be related by dependency constraints: this is just a table T_{ij} with n_i \times n_j entries (where n_i and n_j are the number of states of nodes i and j), and T_{ij}[v_i,v_j]=1 if version v_i of package i is compatible with version v_j of package j, and 0 otherwise. Furthermore, constraints about required packages are implemented as “external fields”. These are a priori “utilities”, and for example we may say that the utility of the “uninstalled” state of a required package is -\infty.

On top of this, there are preferences for some versions of a package compared to others. In particular, we prefer the uninstalled state, if available, and higher versions over lower ones. This is also implemented via the same “external field” mechanism. In order to look at major versions first, then minor versions, then patch, we define a new numeric type. Another level of preference (e.g. giving precedence to maximizing the versions of requires packages instead of non-required ones) is implemented with yet another numeric type.

Finally, the result of MaxSum may be suboptimal. We thus push the solution towards a local optimum via a local greedy search (basically, a hill-climbing algorithm).

As for why using MaxSum may be a good idea, here’s the reasoning. First, message-passing algorithms scale nicely with the size of the graph. A message update is O(n k v^2) where N is the number of nodes (packages), k is the mean number of nodes neighbors (how many other packages are involved in dependency constraints for any given package), and v is the maximum number of versions of a package (this last one is pessimistic, but it boils down to the size of the tables T_{ij} mentioned above, and we can trim it down by version pruning). The number of message updates before decimation and the fraction of variables to decimate (fix) at each step can be set manually and so the number of steps is also manageable. The backtracking unfortunately may make the algorithm exponential, however this hasn’t been a problem in practice so far (as far as I’m aware) and if it becomes one it’s not difficult to put a hard constraint on that. So one strong motivation for this type of algorithm is its scaling.

Another motivation is that is pretty natural to: 1) work on a straightforward representation of the optimization problem, without converting it into a boolean problem of some kind; 2) implement in the same scheme both hard constraints (required packages, dependency relations) and soft preferences (higher versions are better, required packages have precedence, etc).

Finally, heuristic experience shows that indeed these message-passing schemes can work very well in a variety of NP-hard optimization problems, and examples abound there. A famous one is Affinity Propagation, which is just MaxSum applied to the k-medoids clustering problem. Also, this one is MaxSum applied to the Steiner Tree problem. This one is an application to neural networks with discrete weights. For SAT problem, a related algorithm works very well on random instances, and can be augmented with backtracking. Etc.

38 Likes