According to its source code, Pkg.jl uses a dependency resolving algorithm called MaxSum to resolve the package dependency graph. However, I was not able to find any documentation about this algorithm, or any concrete reference to some scientific paper describing it. Can someone add or provide some documentation on this algorithm and maybe some discussion what are the rationales and design decisions behind it? What are the advantages of MaxSum compared to other dependency resolvers, such as PubGrub or SAT solvers? How does MaxSumâ€™s performance relate to the performance of e.g. PubGrub or libsolv?

Yes, Iâ€™ve always wanted to write down some detailed notes about this but never got around to doing it. Iâ€™ll try to sketch some basics here. Please forgive the unpolished resultâ€¦

MaxSum is a limiting case of a more well-known message-passing algorithm for statistical inference called Belief Propagation, also known as SumProduct. In order to derive this limiting case, consider writing a probability p(x) as \exp(\beta \phi(x)) and sending \beta \to \infty. The new quantities \phi can be interpreted as â€śnegative energiesâ€ť in a physical analogy, or â€śnegative costsâ€ť/â€śutilitiesâ€ť in an optimization context. After the limit the Belief Propagation (BP) equations for the probabilities become the MaxSum equations for the log-probabilities-divided-by-\beta, i.e. for the costs. The BP equations (that you can look up on Wikipedia or many other places) change in 3 ways:

- sums are replaced by maximums
- products are replaced by sums
- normalization is performed by shifting (so, by an additive term) rather than by scaling (a multiplicative term)

You can see then that SumProduct becomes MaxSum.

The MaxSum algorithm is not used for statistical inference, but rather for optimization. The BP equations are called iteratively until a fixed point is reached, and from the fixed point one can obtain a parameterization of a probability distribution. In MaxSum, the equations are also called iteratively, but if convergence is achieved the resulting quantities describe the mode of the probability distribution (or the modes if there is more than one, but one normally tries to ensure that a single mode is present, if necessary by breaking ties arbitrarily by adding some tiny random noise to the scores).

Just like BP, MaxSum is only exact when the graph topology on which it runs is a tree. In that case, MaxSum is basically a form of Dynamic Programming. However, BP and MaxSum can also be run over graphs with loops. In that case they are approximate, but if the loops are not â€śtoo structuredâ€ť the approximation may still be excellent, and the algorithm fast (see below). In that case, however, neither BP nor MaxSum are guaranteed to reach convergence. There are heuristic techniques to work around this and drive the algorithm to produce a configuration. The one used in Juliaâ€™s resolver is called â€śdecimationâ€ť, and it works like this. Perform a number of MaxSum iteration, then pick a few variables of the graph, those which are most â€śpolarizedâ€ť (those that exhibit the largest difference between their favorite state and their second-most-favorite state) and fix their values. This effectively simplifies the graph (and generally breaks some loops). Then start over on the simplified graph, until all the variables are fixed or a contradiction is detected. A contradiction is manifested in the algorithm when a hard constraint is being violated.

In the Juliaâ€™s resolver, this is augmented by a back-tracking mechanism: if a contradiction is detected, we try to revert the last decimation step and pick a different choice for the variables to decimate, and their values.

The representation of the problem in Juliaâ€™s solver is as follows: each package is a node of the graph, and it has as many possible states as there are package versions, plus there is one special state which is â€śuninstalledâ€ť. Each pair of packages may be related by dependency constraints: this is just a table T_{ij} with n_i \times n_j entries (where n_i and n_j are the number of states of nodes i and j), and T_{ij}[v_i,v_j]=1 if version v_i of package i is compatible with version v_j of package j, and 0 otherwise. Furthermore, constraints about required packages are implemented as â€śexternal fieldsâ€ť. These are a priori â€śutilitiesâ€ť, and for example we may say that the utility of the â€śuninstalledâ€ť state of a required package is -\infty.

On top of this, there are preferences for some versions of a package compared to others. In particular, we prefer the uninstalled state, if available, and higher versions over lower ones. This is also implemented via the same â€śexternal fieldâ€ť mechanism. In order to look at major versions first, then minor versions, then patch, we define a new numeric type. Another level of preference (e.g. giving precedence to maximizing the versions of requires packages instead of non-required ones) is implemented with yet another numeric type.

Finally, the result of MaxSum may be suboptimal. We thus push the solution towards a local optimum via a local greedy search (basically, a hill-climbing algorithm).

As for why using MaxSum may be a good idea, hereâ€™s the reasoning. First, message-passing algorithms scale nicely with the size of the graph. A message update is O(n k v^2) where N is the number of nodes (packages), k is the mean number of nodes neighbors (how many other packages are involved in dependency constraints for any given package), and v is the maximum number of versions of a package (this last one is pessimistic, but it boils down to the size of the tables T_{ij} mentioned above, and we can trim it down by version pruning). The number of message updates before decimation and the fraction of variables to decimate (fix) at each step can be set manually and so the number of steps is also manageable. The backtracking unfortunately may make the algorithm exponential, however this hasnâ€™t been a problem in practice so far (as far as Iâ€™m aware) and if it becomes one itâ€™s not difficult to put a hard constraint on that. So one strong motivation for this type of algorithm is its scaling.

Another motivation is that is pretty natural to: 1) work on a straightforward representation of the optimization problem, without converting it into a boolean problem of some kind; 2) implement in the same scheme both hard constraints (required packages, dependency relations) and soft preferences (higher versions are better, required packages have precedence, etc).

Finally, heuristic experience shows that indeed these message-passing schemes can work very well in a variety of NP-hard optimization problems, and examples abound there. A famous one is Affinity Propagation, which is just MaxSum applied to the k-medoids clustering problem. Also, this one is MaxSum applied to the Steiner Tree problem. This one is an application to neural networks with discrete weights. For SAT problem, a related algorithm works very well on random instances, and can be augmented with backtracking. Etc.

to answer the question about why we donâ€™t use full SAT solvers, the answer is that the number of packages installed is much smaller than the number of total packages (100 vs 6000 for typical users), so algorithms that scale super-linearly with number of total packages would likely be unacceptably slow.

Iâ€™ve actually written a new resolver algorithm thatâ€™s implemented here. Unlike the current algorithm (or SAT solvers), which only finds one solution, it finds all Pareto-optimal solutions. If no complete solution is possible it will try to find incomplete solutions that satisfy some of the requirements (but collectively cover all of them); if thatâ€™s not possible either, it will find solutions with some incompatibilities. It has very thorough tests for small test cases already and Iâ€™ve started to test it out on real registry data, but have gotten a little stuck on that due to lack of time. Amusingly, loading the registry data efficiently is currently the bottleneck and solving seems to be quite fastâ€”the algorithm is potentially exponential (as any NP-hard problem is believed to be), but seems to be fast in practice.

Why is it useful to find all Pareto-optimal solutions or near-solutions? If thereâ€™s a unique optimal solution then it doesnâ€™t matterâ€”you just use that one. In the case where multiple solutions exist, then for any pair of solutions there must be conflicts between some of their package versions. Letâ€™s call the conflicted packages `A`

and `B`

. We can ask users whether they want a solution that prioritizes a newer version of `A`

at the expense of an older version of `B`

or vice versa. They might not care but they might have a reason to prefer one or the other. When there are no solutions, we can present the problem more clearly to users by giving them options for how to solve the problem: they can either remove requirements or fix conflicts between some of the packages their project depends on. I believe this â€śaction orientedâ€ť approach to presenting resolver conflicts will be easier for users to understand.

Cool! There could be exponentially many optimal solutions - does that just not happen in practice?

There are exponentially many non-optimal solutions but the number of optimal solutions is linear in the number of conflicts on the optimality boundary, which is not that big in practice.