[Ann] Sparips.jl: Practical sparsification of Rips complexes (topological data analysis)

package
announcement

#1

New package:

This is not yet registered, and a very early version anyway. So you will need to install it by

(v1.0) pkg> add https://github.com/bbrehm/Sparips.jl
(v1.0) pkg> build Sparips

The use and rationale is explained at length in the tutorial and the algorithm is linked in the readme. The gist of it is that we can compute persistent homology for Rips complexes, up to any given precision, in a better complexity class by using a sparsified approximate complex instead (formally: asymptotically exponential -> loglinear; more realistically: bad polynomial -> linear / subquadratic, depending on tradeoffs; this practically speeds up the computation enormously, but we still won’t reach the asymptotic regime for large dimensions or very tight desired precisions).

The package wraps Ulrich Bauer’s ripser for the linear algebra; Sparips.jl does the geometry and plotting only.

Second feature is a sensible visualization of approximate persistence diagrams (because bottleneck distance is not the right concept: an interleaving is characterized by a monotone function, not a number).

People doing manifold learning might be interested in the underlying metric tree (modified cover tree; too slow to be competitive for nearest neighbor search, but much tighter estimates when we need to do expensive operations).

Afaik there is no reasonable alternative software doing this (unsurprising, since the algorithm is new).

Feel free to ask any questions here or on github. I think I will register this in a couple of weeks, once I got some more feedback and cleaned up some internal code.

Related threads:

PS. Any eirene devs hanging out here? Eirene is a pure julia solver for the homology computation, as a possible replacement for the c++ ripser. But I understand neither code nor algorithm, and ripser is at least relatively simple (and fast).


#2

I think that no one (besides the author, and maybe not even him, right now;-) understands eirene code. It’s a convoluted, unstructured mess (10kloc??) with no data structures and functions with tens of arguments. Which frankly speaking doesn’t build my confidence in its correctness. Neither does the fact that its documentation is a single, 5-screens high PNG(!!!) file as if the author didn’t want the software to be found.

Understanding ripser turned out a much easier task for me.


#3

Yep, that was also my impression or eirene. Adding to this, eirene implements a completely different matroid-theoretical algorithm, that I just don’t understand. Ripser is not just short, but the algorithm is straight forward (and there is only so much mess you can make in <1k loc). Gudhi is a well-structured comparatively large C++ template library. As far as I understand, these are the only three public competitive solvers, and it depends on the data which one wins out.

I’ll probably wrap gudhi as well. It is unfortunate that there is neither C nor command-line interface; I’d much prefer to just read what pointers I need to pass where to understanding the mountain of abstraction in gudhi. But it appears that gudhi does implement the Sheehy sparsification scheme, so it will be interesting to compare.


#4

Interesting paper!
I wanted to point out that there also is https://github.com/wildart/ComputationalHomology.jl (as well as https://github.com/wildart/TDA.jl to compute barcodes) to compute the homology of a simplicial complex though I don’t know how it compares in terms of features / performance with either eirene or ripser.

Concerning the wrapping of the ripser library, I was wondering whether it would be possible to “widen” the signature of the runrips function that’s used to call ripser: right now it only accepts a specific type SparseEdges whereas in theory (as far as I understand) one could allow any Symmetric matrix as input.


#5

Interesting paper!

Thanks!

whereas in theory (as far as I understand) one could allow any Symmetric matrix as input.

In theory yes. But the entire point is to compute a sparse matrix with approximation guarantees, plug that into ripser, and track the approximation guarantees of the resulting barcodes (have error bars on the bars).

Sparsity is key here (brings you into a more sensible complexity class). Back then, ripser did not accept sparse inputs, which is why sparips.jl ships a modified version.

Also, back then I was more optimistic about the capabilities of existing solvers. Realistically, starting from a distance matrix (already N^2 entries) and feeding a sparsified version into ripser would have been acceptable, simpler and maybe even faster (staying in loglinear complexity was very important to me, until I experienced just how memory-hungry existing solvers are, even with sparsification; hence, N cannot be large, anyway).

Uli Bauer plans (or already pushed?) a variant of ripser that accepts sparse inputs; then I’ll scrap my patches, and use the upstream version. Maybe I’ll use that as an opportunity to write a better wrapper (using ccall and binary data, instead of ascii via command-line; but upstream does not maintain a C api; same issue applies to gudhi, which only supports python or C++ template hell).

But sure, I can push a version that accepts a Vector{Tuple{<:Integer, <:Integer, <:AbstractFloat}} of entries as input (then without tracking of interleaving bounds).


#6

I now understand a bit better. Either that or a SimpleWeightedGraph would be a nice way to send distances to ripser (with the intuition that if there’s no edge between a and b then the distance is infinite). My use case is more pedagogical, i.e. starting to play with persistent homology without having to worry about sparsification and then learn sparsification when scaling to real datasets.


#7

For very small datasets, you can use 1.0 or 1.0f0 as precision; then you get no sparsification and hence the full thing (use the same float-type as your data/metric).

I take the distance as a subtype of <: Distances.Metric from Distances.jl. You can pass a closure:

using Distances
pointT = Int64
struct DMat{fT} <: Distances.Metric
    vals::Matrix{fT}
end
Distances.evaluate(m::DMat, i, j) = m.vals[i,j]
distance_matrix::Matrix{<:AbstractFloat}
@assert size(distance_matrix, 1)== size(distance_matrix, 2)
N = size(distance_matrix,1)
data = collect(1:N)
dist = DMat(distance_matrix)

While I do not particularly like the Distances.jl interface, it is definitely OK and it is the standard in the ecosystem (I would have preferred callable type <:Function, for pedagogical reasons).

As I said, the focus in sparips is not so much the wrapper around ripser as the computation of the sparsification: In goes the number of points plus an oracle for computing distances between points, out goes a simple weighted graph that approximates the topology of your original point cloud, while making only log-linear many oracle-queries. Then that gets fed into ripser or you favorite solver (I am currently writing one, because ripser eats too much memory).


#8

I can’t belive I missed this thread since this is exactly the kind of stuff I’m interested in!

I’ve been slowly working on my own ripser wrapper (found here, binaries here), but it has no features and only handles calling the C++ code.
I haven’t announced about it yet because I have some problems I want to debug. I’m getting segfaults when I build it with clang and ReadOnlyMemoryErrors when running it on 32-bit Windows systems.

I think you might find it interesting because it uses the ripser C interface that was introduced in this python wrapper. It also supports sparse inputs and returning representative cocycles.

I’d love to help you with your project. Let me know if you need some help. In the meantime, I’ll try to familiarize myself with your code.


#9

By the way, are you aware of flagser? It’s also an extension of ripser that does some approximations, but written in C++.
I don’t know exactly how it works but AFAICT it serves a similar purpose as your project.


#10

I haven’t announced about it yet because I have some problems I want to debug. I’m getting segfaults when I build it with clang and ReadOnlyMemoryError s when running it on 32-bit Windows systems.

This is always painful to debug, but for issues with the binaries created with BinaryBuilder the best place to ask may be the BinDeps2 slack channel.

Concerning the ripser package, I wanted to ask: is it easy, given the representative cocycles, to find representative cycles? EDIT: it seems most papers on persistent homology focus on the case where coefficients are in a field (say \mathbb F_2), in which case it should be trivial to go from homology to cohomology.


#11

Concerning the representative cocycles, there may be a “reshaping” problem: if I understand correctly, each cocycle is given by the Julia wrapper as a vector of integers, which, if the cocycle is on edges for example, is to be interpreted as follows:

[src, target, value, src, target, value, ...]

(where src and target are the extrema of the various edges on which the cocycle is non-zero). The python wrapper reshapes it to have dimension Nx3 in this case, I’m not sure what’s the ideal solution.

EDIT: I’m also not 100% sure why no 0-cocycles are given: I understand that they are simply one per point of the data but still it could be useful to have them to know which points “live longer” in the barcode.


#12

Actually, I’m not so sure about using the cocycles because I’m still learning this. I opted for just returning them in a long vector and leaving cutting them up to the user.

And yeah, ripser calculates the persistent homology in \mathbb Z_p where p is the value of the modulus parameter, so extracting cycles should be possible.

Thanks about the slack tip, I think I’ll do that, even though I also get segfaults with clang on my (Linux) machine.


#13

I was not so aware of the scikit project. Sounds nice, maybe sparips should use their C bindings.

By the way, are you aware of flagser?

Yes. But if I understood it correctly, it does nothing for the problem of bad complexity class (however, if there are flagser features that people need, then it should be easy to also wrap that).

For Betti numbers, switching is trivial, yes. For cycle/cocycle representatives, switching is nontrivial. Integral persistent homology is much harder, because we lose the classification theorem for indecomposable persistence modules. I’d love to see a solver for that, though.

Cool! First helpful thing would be if you shoot away with questions. That can help me see where I messed up API decisions.


#14

From what I understand it lets you skip some of the computation and is supposed to be suitable for using on large inputs. It also apparently parallelizes complex construction. I heard the author used it to analyze the data from the Blue Brain Project.
Not sure if it really changes the complexity class, but I thought it might be interesting, especially since you can probably read C++ better than me. However, I personally don’t really need it.

If you want to use the ripser.py bindings, you can use my code, but I/we have to figure out the build problems first.

I’ll go through your code and try it out and I’ll leave you an issue if I run into anything unexpected. :grin:


#15

I see. I’m actually interested in also having access to the cocycles, it’d be nice to figure out what’s the best format to return them. One other thing that may be worth figuring out is whether the Julia wrapper should fix the “0-based indexing”, in that in the long vector representing a cocycle, 0, 10 for example refers to the edge between something that intuitively is points 1 and 11 (their distance is M[1, 11] in the matrix passed as argument to ripser).


#16

I think ComputationalHomology works for PID with reasonable generality (see here for a discussion of what methods are needed for it to work on a given type).

I’ve implemented persistent cohomology (assuming coefficients in a field) following Persistent Cohomology and Circular Coordinates - Vin de Silva, Dmitriy Morozov, Mikael Vejdemo-Johansson as a pure Julia package: https://github.com/piever/PersistentCohomology.jl.

I’m not sure if it can be a replacement for ripser in terms of performance (I have not benchmarked but I tried to be somewhat careful about performance) but on the plus side it supports sparse matrices out of the box (it actually only supports sparse matrices at the moment, somehow I don’t think the full matrix case is computationally feasible for a reasonably sized dataset).


#17

Oh, I totally missed the 0-based indexing! I think one approach would be to add 1 to all indices and return cocycles as vectors of tuples of Vector{Int}, Int. Another one would be to create a struct, similar to how you did it in your package.

I can add this to my project, but I think in the long term, it’s better to merge it into foobar’s project, which is way more polished.


#18

Cool! Is @wildart, the author of ComputationalHomology.jl hanging out here? Does this support CW-inputs? (input: list of cells with dimension, sparse homology boundary matrix; complex is really CW and not simplicial / cubical / etc)

So, is it time for a julia computational topology org? That one could help consolidate data structures / formats, plotting, wrappers around external libraries like ripser / gudhi and generally make everything more modular. For example, receiving input metric spaces, generating sparse matrices (in case of sparips, with user-supplied error tolerances), computing homology/cohomology, visualizing output, and analyzing output (bottleneck distances, statistical analysis, whatever) want to be separate packages with clean interfaces that can be swapped out for each other.

Thanks for that! But it is not really super polished, and more of a proof-of-concept for the new algorithms.

There are so many warts that want to be fixed, e.g.:

What moron decided to stay with the initial point permutation, instead of choosing one of the many cache-oblivious / metric-locality aware memory layouts?

What moron decided to support error tolerances that are supplied by evaluation-oracle, without users supplying oracles for inverse and some more transforms? Especially the latter bites me in so many unforseen ways that it’s not even funny; yes, it is possible and very convenient for users; no, it was not a good idea in retrospect.


#19

Definitely: the first step is probably to set a common interface for things to play nice with each other.

In PersistentCohomology I reimplemented Vietoris Rips both as a learning exercise and to be sure it was exactly the way I wanted, but I’d be happy to out-source that part (I would also like to outsource plotting and working with barcodes / persistent diagrams in general).

In terms of interface, so far I think that a simplex is something that supports the following interface: a face(s::S, i) function that returns the “face” simplex and a nv(::Type{S}) function that can tell the number of vertices of the simplex from the type (this allows for optimizations with generated functions). In my package I have implemented the interface for tuples, but it’s quite easy to implement for a custom type if we want to. I also think it’s important to have the concept of a Cochain (see docstring for explanation) in a way that the value on each simplex can be accessed efficiently with getindex, say cochain[simplex].

For the overall structure of a simplicial complex I was thinking of simply a tuple where the first term correspond to the 0-simplices and so on. In PersistentCohomology I use a Float64-valued (or any other float type) Cochain to encode both the simplices and the diameter. It’s maybe not conceptually very clean but quite handy.


#20

I’m very much in favor of that. I was actually thinking of proposing it myself, but I held back because I’m currently in the middle of an exam period… :grin:

@piever I think that’s a good start. IMO, we should keep everything as abstract as possible let people attach data to their simplices, barcodes etc.

Should we keep everything in some kind of TopologyBase or is it better to split things into smaller units?