SimilaritySearch.jl: A package for searching approximate nearest neighbors efficiently

Eric_Sadit_Tellez · February 9, 2022, 1:42am

Hello everybody,

We are glad to announce the package SimilaritySearch.jl package. The SimilaritySearch.jl package uses a graph index and search heuristics (beam search) to solve queries fast and accurately; the current version is auto-tuned for some minimum quality or for achieving a tradeoff between speed, accuracy, memory, and construction time (all of these are opposite objectives in this kind of algorithms). It has some level of multithreading support (searching and construction). Also, it achieves performances very close to state-of-the-art methods (or even be. Yet, it can use any SemiMetric defined in Distances.jl and the full speed with your distance definitions.

I also want to mention that the package comes with some distance functions definitions like Levenshtein distances, distances between sets defined as ordered lists, bit-based hamming distance, Hausdorff distance, and some dense-vector based distance functions (some of them are repeated with Distances.jl, some refactoring is needed).

The documentation is still in progress, but we have an arxiv paper with benchmarks and a repository with several demonstrations if you are interested in some examples of how to use it. Please consider that these demonstrations also uses a modified version of the UMAP.jl package. You can find the modified UMAP.jl here.

We will be happy if this package helps you in any way. Contributions are also welcome.

Regards,
Eric

Eric_Sadit_Tellez · March 11, 2022, 11:15pm

Dear all,

We are glad to comment that we have improved the SimilaritySearch package’s documentation. We also worked on a site for examples and demos.

https://sadit.github.io/SimilaritySearchDemos/

You can find some basic examples and other more elaborate UMAP visualizations. We had also added some examples with the ManifoldLearning package.

In particular, we added several Pluto notebooks and static HTML previews that can run in mybinder.
Some examples are perhaps too costly for running in single-threaded systems, but they will run anyway. Please note that some are better run in a local system since they will download large datasets that may take too long. In any case, please be patient.

If you like the package and find it helpful, please be kind and put a star on Github’s repository.

If you find any issue or request, please file an issue or comment on Github.

Kind regards,
Eric

TheLateKronos · March 12, 2022, 11:14pm

Looks nice! Could you (or @kristoffer.carlsson ) contrast this package with NearestNeighbors.jl?

Eric_Sadit_Tellez · March 13, 2022, 4:49pm

Hi @TheLateKronos

The SearchGraph index structure in SimilaritySearch is an approximate similarity search method, in the same niche that the HNSW/HNSW.jl and the (pyNNDescent)[PyNNDescent API Guide — pynndescent 0.5.0 documentation][NNDescent.jl]/(https://github.com/dillondaudert/NearestNeighborDescent.jl) or SCANN. Focused on medium sizes (10^5-10^8) datasets with high dimensionality. As these alternatives, ideally, it works with any distance function and improves these in aspects like memory and construction time and the easy inline auto-tuning feature.

I didn’t benchmark yet against NearestNeighbors. However, the rapid answer is that KDTree and Balltree will perform better for lower dimensional datasets. Mainly, KDTrees uses the structure of the data (vector’s coordinates), and therefore it works better for certain distance functions (Minkowski ones). The Balltree works for general metrics spaces but also low dimensions. Both indeed have faster constructions and lower memory requirements.

The rules of thumb (to be proved…) should be:

If you have low dimensional data, use NearestNeighbors.
If you have high dimensional data and can afford the approximation issue, use SimilaritySearch.
For weird distance functions, perhaps you want to use SimilaritySearch since it indeed can work faster if the intrinsic dimensionality is high

Some nice features to take into account.

SearchGraph supports multithreading; it works pretty fast for large datasets
In contrast with the original (python) NNDescent and HNSW, SearchGraph is made in Julia, and therefore user-defined distance functions can run quite fast
It has an online auto-tuning feature so that you ask for some quality, and it will try to adjust to it; of course, there is always a cost of some performance marker.
Computing the all knn graph is quite fast, it supports multithreading, and also it is optimized for doing it quickly. This feature is designed for UMAP and ManifoldLearning techniques etc.

Regards,
Eric

JAgho · March 13, 2022, 5:46pm

This looks really useful for my work! Thank you so much for providing it (and documentation!)

Is it possible to update the graph at all, or is it constructed once per use?

Eric_Sadit_Tellez · March 13, 2022, 6:45pm

SearchGraph supports insertions via push! (one item) and append! (many objects at once, taking advantage of multithreading). The structure has an incremental construction.

Currently, there is no support for deletion or updates. Nonetheless, update operations can be implemented directly if the modification has little impact on the neighborhood of the modified object; clearly, the latter is a kind of application and domain-specific treat. General updates and deletions are tricky since they may impact results quality if applied frequently.

davidbp · March 13, 2022, 8:59pm

Very interesting package. Thanks a lot for sharing this! I wonder how much faster it could go if instead of computing the exact distances as fitness it could leverage approximated distances with ADC like in PQLinearScann, along the lines of [2203.02505] ARM 4-BIT PQ: SIMD-based Acceleration for Approximate Nearest Neighbor Search on ARM . Julia seems an excellent language to build an Approximate Nearest Neighbour Search toolbox.

Eric_Sadit_Tellez · July 5, 2022, 4:53pm

Happy to share that our JOSS paper about the SimilaritySearch.jl package has been accepted for publication. It covers the last version (v0.9.3) with some speedups compared with v0.8 due to several minor improvements and the use of the Polyester.jl package instead of plain Threads. We also updated documentation and examples https://sadit.github.io/SimilaritySearchDemos/.

https://twitter.com/JOSS_TheOJ/status/1544336092798668800?s=20&t=vPnEnaTafp0hwWB4B-dNLQ

jbytecode · July 5, 2022, 6:55pm

congrats!

Topic		Replies	Views
[ANN] Faiss.jl, similarity search General Usage package , announcement	18	1769	March 27, 2022
[ANN] SimSearchManifoldLearning.jl: Non-linear dimensional reduction using SimilaritySearch (ManifoldLearning and UMAP) Package Announcements	0	346	March 24, 2022
Alternative packages for nearest neighbor search General Usage package , alternatives	2	109	July 9, 2025
Neighborhood.jl: Unified API for finding nearest neighbors in Julia Package Announcements package	15	2400	February 13, 2021
[ANN] VPTrees.jl - nearest neighbor search in metric spaces Package Announcements	20	2688	January 30, 2020

SimilaritySearch.jl: A package for searching approximate nearest neighbors efficiently

Related topics