[Ann] Sparips.jl: Practical sparsification of Rips complexes (topological data analysis)

Ok, let’s collect what we want interfaces for.

What sparips wants interfaces for:

  1. distances: Using Distances.Metric seems fine.
  2. Sparse filtered graphs: Each vertex and each edge has a filtration value. Maybe use something from the LightGraphs universe?
  3. Tolerances: For approximations, we need error tolerances. A tolerance is an interleaving; basically a nondecreasing function psi with psi(r) <= r; but we probably want to require the inverse of psi as well, and potentially some other transforms. Potentially, we would permit two functions (in sparips, the reverse direction of psi is always the identity). The interface would spec some functions, and assumed properties.
  4. Persistence diagrams / barcodes: Basically an array of arrays of tuples (dimension, list of classes, birth, death). Also, optional metadata on each class (solver statistics, representative cycles, cocycles) and metadata on the entire diagram (e.g. the field in which it was computed).
  5. Approximate persistence diagrams: A persistence diagram plus a tolerance. This is important for plotting and for filtering artifacts coming from an approximation.

How would I restructure my code:

  1. excise ripser bindings: in goes filtered graph, out goes persistence diagram.
  2. excise plotting: in goes persistence diagram and optional tolerance, out goes plot.
  3. remaining in sparips.jl: in goes metric, (opaque) data and tolerance, out goes sparse filtered graph.
  4. new (trivial): In goes metric and (opaque) data; out goes dense filtered graph. (effectively distance matrix)
  5. new: some bindings for gudhi. Gudhi provides an alternative sparsification, with relative error tolerances. Gudhi also provides solvers.
  6. new: Some statistics on filtered graphs (to measure how sparse they are, and how bad their rips complex will be).

Something that is very questionable: Input is filtered graph, output is filtered simplicial complex (clique complex / rips complex / induced flag-complex) up to some dimension. Reason for that: too much memory consumption. If you know that you are flag, then it should be an internal decision of the solver whether to ever commit the complex / its boundary matrix to memory (imo a bad decision), or recompute it on-the-fly (as ripser mostly does).

Extra things that are relevant: Many sparsification approaches can also produce a smaller zig-zag filtration that is homotopic to the original one. We should permit space for that in the filtered graph interface, in order to accomodate future solvers that can make use of this structure. As far as I know, all these things work via just reducing the number of points. So each vertex gets not just a filtration time filt_time(g, v) (such that v is not present at smaller times), but also an optional zig_zag_time(g, v), that asserts that v can be deformed away at larger times. Hence, v is only present in the zig-zag if zig_zag_time(g,v)<t<filt_time(g,v). Also, an optional zig_zag_parent(g,v, t), for t>zig_zag_time(g,v), that asserts that mapping v->parent is a deformation at time t. (obviously this needs some bike-shedding)

What are the interfaces you would like to have?

Do you want to support non-flag simplicial complexes?

I have programed some generic interface for complexes along with a simplicial complex implementation. Implementing a CW complex shouldn’t be a problem.

For any complex it is possible to construct sparse boundary matrix as soon as complex implements some interface functions.

From my experience, I advice to create a proper structures for cells (simplices) & persistent intervals. Using tuples initially looks appealing, but in a future when some additional information will be needed to add to these structures, it would be a lot of work to rewrite existing code.

I imagine a basic workflow like this:

data = #... some points, stored as Vector{SVector} and a Metric or a (sparse) distance matrix or a Graph, etc...
filtration = RipsFiltration(data)
result = persistent_[co]homology(filtration, dim_max = 2)
# do things with result

Here, the RipsFiltration can be just a thin wrapper over the data, or it can be a full blown complex. The algorithm is then chosen based on dispatch and the algorithm itself decides whether it wants to commit the whole thing to memory or not.
We have one type of Filtration (or whatever the best name for this is) per solver, e.g. RipserFiltration, one for each method in Gudhi etc.
The idea is to get something similar to the DiffEq ecosystem where you can swap solvers around pretty painlessly.

I think flag complexes are enough for now, but we should make sure that the interfaces are flexible enough to make supporting other kinds of complexes easy.

Edit: I’ll whip up a quick diagram package prototype. I already have some old code for that lying around.

Ok, so how would I imagine variations of the workflow (open to immense bikeshedding):

using TopologyBase, Plots
import Sparips
import SomeSolver
import Distances
import PersistencePlots
import PersistenceExtra
data = #... some points, stored as Vector{SVector} and a Metric or a (sparse) distance matrix or a Graph, etc...
precision = AffinePrecision(0.03, 1.25) #implies an interleaving with identity / x-> 1.25*(x+0.03)
filtration = ripsfiltration(data; distance = Distances.Euclidean(), precision = precision, method = Sparips.APosteriori())
result = persistent_homology(filtration; method = SomeSolver.Cohomology_Twist(), representatives = false, p = 5, dims = 0:3) #compute in GF(5), persistence pairs only, in dimensions 0:3
plot(result; approx_diag = true, logscale = true, filter = :default, style = :bars, dim = 2) #the `import PersistencePlots` overrides the plot method; we want visualization of the approximation, barcodes instead of persistence diagram, logscale, and use the default filter to hide very short bars, and only dimension 2
some_other_result = #...
bottleneck(result, some_other_result; dist = :auto, dim = 1) # compute bottleneck distance of 1-dimensional bars; use the precisions in order to guess what variant the user means (mix of additive bottleneck and log-bottleneck in case of affine precisions; if the precision is e.g. `ExactPrecision()`, then use additive). Function declared in TopologyBase and implemented in TolologyExtra.

Then, there are alternatives: Instead of ripsfiltration, one could use alphafiltration, or weighted_rips, or witness, etc, provided that somebody implements that and solvers can work with the result.

Since we can’t really dispatch on keyword arguments, TopologyBase would define some trivial functions to make relevant keywords participate in dispatch: For example, method is a mandatory argument, so call a version with it in a position, which should bring us into the right module that can then unpack all the other keywords it understands. Idea of making it keyword at all is that interactive users don’t have to remember a sequence of 5 positional arguments, if they are OK with inference possibly failing on their outer call.

Wow, this is making really nice progress!

In order:

Agreed!

There is a weighted graph type in GitHub - JuliaGraphs/SimpleWeightedGraphs.jl: Edge-weighted graphs compatible with Graphs.jl, but it’s just implemented as a sparse matrix. I would suggest just using AbstractSparseMatrix here.

Instead of array of tuples (sorry for the self-promotion) one could use a StructArray. This allows to store the various vectors of structures as a bunch of vectors which in turns allows for easier editing / data sharing. Here it’s probably not performance critical, but elsewhere it may be relevant.

:100:
In PersistentCohomology the strategy is to only store simplices (in sorted order) and then look up the information with searchsortedfirst(simplices, simplex). This allows to compute the value of a cochain on a boundary very effectively without storing the boundary.

The only real requirement that I would like to add here is to encode the length of the simplex in its type (so that face and boundary functions can be optimized generated functions) and that making a simplex from a tuple does not allocate.

The approach used in PersistentCohomology is to use tuples for simplices and store “metadata” about the simplex in a Cochain object, see here. Then cochain[simplex] calls cochain.values[searchsortedfirst(cochain.simplices, simplex)] (cochain.values will often be sparse). Would this satisfy your requirements?

Alternatively one can view the simplex itself as a containing both the information on the vertices (which can be used as a key to identify it) as well as some metadata (the value). So as long as Tuple(s) is an efficient and type stable way to get the tuple of vertices things should be fine.

Regarding boundary matrix: We should definitely avoid sparse matrices, and rather rely on something like boundary_operator(complex, simplex), coboundary(complex, simplex; filtration_val = r), faces(complex, simplex; filtration_val = r). Likewise, I think we cannot use the admittedly nice syntax some_cochain[some_simplex], and should rather take evaluate(complex, some_cochain, some_simplex).

We also need to distinguish between simplices and cells: Homology-CW-complexes are pretty nice, computationally speaking.

Side question: Suppose I am super greedy, and want the full cohomology ring (possibly even massey products) of a CW complex. What data do I need?

Clearly, homology of the boundary maps is not enough (e.g.: If I have a 2-cell with zero homological boundary, I cannot determine the connected component it lies in). Clearly, modulo homotopy is too much: We don’t need that fine data (and representing boundary maps modulo homotopy is possible, via an abstract triangulation of the boundary sphere, but is super ugly and is undecidable to simplify/deform). So, what data would I need on the boundary maps?

I fully agree on using this functions instead of sparse matrices, but I’m not sure I understand the need of adding complex object to the function call. Why not coboundary(simplex), cochain[simplex] etc? The (co)boundary does not depend on the filtration value or on the complex structure, neither does the value of the cochain on a simplex. The list of tuples returned by a boundary(simplex) can then be used to find the indices of those simplices from the complex (which will be very efficient if they are sorted).

I’m very interested in adding structure to the persistent cocycles, but unfortunately don’t have very good ideas on how to do that.

Here is what I hacked together today.
It’s a package that defines the basic abstract and concrete types for intervals and diagrams along with two basic plotting functions. It’s not meant to be remotely final, but just a demonstration of an option. Please bikeshed to death and back. Issues/PRs welcome as well.
I never used StructArrays, but I think that’s a pretty good idea and shouldn’t be too hard to add. I tried to document everything with docstrings, but let me know if you find something you don’t understand.

@foobar_lv2 the workflow you wrote down looks something like I imagined. Obviously the design has to be tested and iterated, but it looks OK so far to me.

About the simplices: I would define the types as:

 # N-simplex with values of type U
abstract type AbstractSimplex{N, U} <: StaticArray{Tuple{N}, U, N} end

# Vanilla simplex. Add more simplex types for simplices with attached data.
struct Simplex{N, U} <: AbstractSimplex{N, U}
    vertices ::NTuple{N, U}
    # maybe the constructor should sort the vertices?
end
#... methods for getindex, faces etc.

I think there are some nice benefits to having simplices be Arrays (essentially, you get some nice methods for free), but I’m not 100% sure this is the right way. I’m also not sure how cells fit into this.

By the way, should we split this discussion into a new thread?

In terms of intervals, couldn’t we just use GitHub - JuliaMath/IntervalSets.jl: Interval Sets for Julia? It seems fully featured and has the option Interval{:closed, :open} which I think is the one we need.

# Vanilla simplex. Add more simplex types for simplices with attached data.
struct Simplex{N, U} <: AbstractSimplex{N, U}
    vertices ::NTuple{N, U}
    # maybe the constructor should sort the vertices?
end

Nice! I’d skip sorting the vertices on construction. The reason is somewhat convoluted. If you implement the “inductive construction” of Vietoris Rips from a filtered graph (which seems like a reasonably efficient one according to this paper) you are able to get simplices that are already sorted in lexicographic order if you take the vertices from the biggest to the smallest (so an example simplex would be like (31, 25, 1)). This turns out to be very convenient (to look for a specific simplex in the complex) and I’d rather not give it up. An alternative would be to sort the vertices from small to big and then implement an efficient antilexicographic comparison for simplices, but it seems more complicated.

This is true, there is no need to sort simplices for “inductive construction” of Vietoris-Rips because underlying total order of low-dimensional simplex vertices is imposed on high-dimensional during the construction. However, if you try to do " incremental construction” which uses a simplex search then you would need keep simplex vertices ordered otherwise during the search you’d need to look at all vertices’ permutations. Similarly, enumerating simplex faces would require total order on all vertices in the complex.

I think hashing of the vertices can be used instead of sorting for a fast search, such that 0-simplex hashes are used as to construct any other n-simplex hashes (n > 0), these hashes will create a particular total order that can be searched with O(log n) complexity (i.e. searchsorted).

That is a good idea. Make sure that all methods that use TypedEndpointsInterval{L,R,T} types, because intervals might be annotated with corresponding generators. I looked closer to this package, unfortunately it provides only operations on intervals but not a collection of them which is essentially a persistent diagram or barcode.

If you need normed metric space geometric point / line / volume data with higher dimensions

is meant to handle conformal geometric objects including points and lines and volumes in higher dimensions using explicit SVector types.

As part of my upcoming commits, I am planning to implement a fully generalized interior + exterior product space which can be used for metric and Hilbert spaces also.

Don’t know if this would be specifically useful for you, but I’ve also been thinking about a TopologyBase package recently.

I see, this is probably good evidence that you were right in recommending an AbstractSimplex interface where different algorithms may require a sorted simplex, a simplex sorted backward, a hashed simplex with no sorting, etc…

At least in my mind, this is actually fine: we should choose a container (I think in your package is a Dict of vectors, I use a Tuple of vectors, we should probably standardize) that would be used for cocycles, cycles and intervals. I don’t think this container type should be part of the intervals package. Concerning the typing I think that if a user passes distances in a given type T then we should use Interval{:closed, :open, T}.