SimilaritySearch.jl: A package for searching approximate nearest neighbors efficiently

Hello everybody,

We are glad to announce the package SimilaritySearch.jl package. The SimilaritySearch.jl package uses a graph index and search heuristics (beam search) to solve queries fast and accurately; the current version is auto-tuned for some minimum quality or for achieving a tradeoff between speed, accuracy, memory, and construction time (all of these are opposite objectives in this kind of algorithms). It has some level of multithreading support (searching and construction). Also, it achieves performances very close to state-of-the-art methods (or even be. Yet, it can use any SemiMetric defined in Distances.jl and the full speed with your distance definitions.

I also want to mention that the package comes with some distance functions definitions like Levenshtein distances, distances between sets defined as ordered lists, bit-based hamming distance, Hausdorff distance, and some dense-vector based distance functions (some of them are repeated with Distances.jl, some refactoring is needed).

The documentation is still in progress, but we have an arxiv paper with benchmarks and a repository with several demonstrations if you are interested in some examples of how to use it. Please consider that these demonstrations also uses a modified version of the UMAP.jl package. You can find the modified UMAP.jl here.

We will be happy if this package helps you in any way. Contributions are also welcome.

Regards,
Eric

16 Likes

Dear all,

We are glad to comment that we have improved the SimilaritySearch package’s documentation. We also worked on a site for examples and demos.

https://sadit.github.io/SimilaritySearchDemos/

You can find some basic examples and other more elaborate UMAP visualizations. We had also added some examples with the ManifoldLearning package.

In particular, we added several Pluto notebooks and static HTML previews that can run in mybinder.
Some examples are perhaps too costly for running in single-threaded systems, but they will run anyway. Please note that some are better run in a local system since they will download large datasets that may take too long. In any case, please be patient.

If you like the package and find it helpful, please be kind and put a star on Github’s repository.

If you find any issue or request, please file an issue or comment on Github.

Kind regards,
Eric

4 Likes

Looks nice! Could you (or @kristoffer.carlsson ) contrast this package with NearestNeighbors.jl?

Hi @TheLateKronos

The SearchGraph index structure in SimilaritySearch is an approximate similarity search method, in the same niche that the HNSW/HNSW.jl and the (pyNNDescent)[PyNNDescent API Guide — pynndescent 0.5.0 documentation][NNDescent.jl]/(GitHub - dillondaudert/NearestNeighborDescent.jl: Efficient approximate k-nearest neighbors graph construction and search in Julia) or SCANN. Focused on medium sizes (10^5-10^8) datasets with high dimensionality. As these alternatives, ideally, it works with any distance function and improves these in aspects like memory and construction time and the easy inline auto-tuning feature.

I didn’t benchmark yet against NearestNeighbors. However, the rapid answer is that KDTree and Balltree will perform better for lower dimensional datasets. Mainly, KDTrees uses the structure of the data (vector’s coordinates), and therefore it works better for certain distance functions (Minkowski ones). The Balltree works for general metrics spaces but also low dimensions. Both indeed have faster constructions and lower memory requirements.

The rules of thumb (to be proved…) should be:

  • If you have low dimensional data, use NearestNeighbors.
  • If you have high dimensional data and can afford the approximation issue, use SimilaritySearch.
  • For weird distance functions, perhaps you want to use SimilaritySearch since it indeed can work faster if the intrinsic dimensionality is high

Some nice features to take into account.

  • SearchGraph supports multithreading; it works pretty fast for large datasets
  • In contrast with the original (python) NNDescent and HNSW, SearchGraph is made in Julia, and therefore user-defined distance functions can run quite fast
  • It has an online auto-tuning feature so that you ask for some quality, and it will try to adjust to it; of course, there is always a cost of some performance marker.
  • Computing the all knn graph is quite fast, it supports multithreading, and also it is optimized for doing it quickly. This feature is designed for UMAP and ManifoldLearning techniques etc.

Regards,
Eric

2 Likes

This looks really useful for my work! Thank you so much for providing it (and documentation!)

Is it possible to update the graph at all, or is it constructed once per use?

1 Like

SearchGraph supports insertions via push! (one item) and append! (many objects at once, taking advantage of multithreading). The structure has an incremental construction.

Currently, there is no support for deletion or updates. Nonetheless, update operations can be implemented directly if the modification has little impact on the neighborhood of the modified object; clearly, the latter is a kind of application and domain-specific treat. General updates and deletions are tricky since they may impact results quality if applied frequently.

Very interesting package. Thanks a lot for sharing this! I wonder how much faster it could go if instead of computing the exact distances as fitness it could leverage approximated distances with ADC like in PQLinearScann, along the lines of [2203.02505] ARM 4-BIT PQ: SIMD-based Acceleration for Approximate Nearest Neighbor Search on ARM . Julia seems an excellent language to build an Approximate Nearest Neighbour Search toolbox.

2 Likes