We are glad to announce the package SimilaritySearch.jl package. The SimilaritySearch.jl package uses a graph index and search heuristics (beam search) to solve queries fast and accurately; the current version is auto-tuned for some minimum quality or for achieving a tradeoff between speed, accuracy, memory, and construction time (all of these are opposite objectives in this kind of algorithms). It has some level of multithreading support (searching and construction). Also, it achieves performances very close to state-of-the-art methods (or even be. Yet, it can use any SemiMetric defined in Distances.jl and the full speed with your distance definitions.
I also want to mention that the package comes with some distance functions definitions like Levenshtein distances, distances between sets defined as ordered lists, bit-based hamming distance, Hausdorff distance, and some dense-vector based distance functions (some of them are repeated with Distances.jl, some refactoring is needed).
The documentation is still in progress, but we have an arxiv paper with benchmarks and a repository with several demonstrations if you are interested in some examples of how to use it. Please consider that these demonstrations also uses a modified version of the UMAP.jl package. You can find the modified UMAP.jl here.
We will be happy if this package helps you in any way. Contributions are also welcome.
We are glad to comment that we have improved the SimilaritySearch package’s documentation. We also worked on a site for examples and demos.
You can find some basic examples and other more elaborate UMAP visualizations. We had also added some examples with the
In particular, we added several Pluto notebooks and static HTML previews that can run in mybinder.
Some examples are perhaps too costly for running in single-threaded systems, but they will run anyway. Please note that some are better run in a local system since they will download large datasets that may take too long. In any case, please be patient.
If you like the package and find it helpful, please be kind and put a star on Github’s repository.
If you find any issue or request, please file an issue or comment on Github.
Looks nice! Could you (or @kristoffer.carlsson ) contrast this package with NearestNeighbors.jl?
SearchGraph index structure in
SimilaritySearch is an approximate similarity search method, in the same niche that the HNSW/HNSW.jl and the (pyNNDescent)[PyNNDescent API Guide — pynndescent 0.5.0 documentation][NNDescent.jl]/(https://github.com/dillondaudert/NearestNeighborDescent.jl) or SCANN. Focused on medium sizes (10^5-10^8) datasets with high dimensionality. As these alternatives, ideally, it works with any distance function and improves these in aspects like memory and construction time and the easy inline auto-tuning feature.
I didn’t benchmark yet against
NearestNeighbors. However, the rapid answer is that
Balltree will perform better for lower dimensional datasets. Mainly, KDTrees uses the structure of the data (vector’s coordinates), and therefore it works better for certain distance functions (Minkowski ones). The Balltree works for general metrics spaces but also low dimensions. Both indeed have faster constructions and lower memory requirements.
The rules of thumb (to be proved…) should be:
- If you have low dimensional data, use
- If you have high dimensional data and can afford the approximation issue, use
- For weird distance functions, perhaps you want to use
SimilaritySearch since it indeed can work faster if the intrinsic dimensionality is high
Some nice features to take into account.
SearchGraph supports multithreading; it works pretty fast for large datasets
- In contrast with the original (python)
SearchGraph is made in Julia, and therefore user-defined distance functions can run quite fast
- It has an online auto-tuning feature so that you ask for some quality, and it will try to adjust to it; of course, there is always a cost of some performance marker.
- Computing the all knn graph is quite fast, it supports multithreading, and also it is optimized for doing it quickly. This feature is designed for UMAP and ManifoldLearning techniques etc.
This looks really useful for my work! Thank you so much for providing it (and documentation!)
Is it possible to update the graph at all, or is it constructed once per use?
SearchGraph supports insertions via
push! (one item) and
append! (many objects at once, taking advantage of multithreading). The structure has an incremental construction.
Currently, there is no support for deletion or updates. Nonetheless, update operations can be implemented directly if the modification has little impact on the neighborhood of the modified object; clearly, the latter is a kind of application and domain-specific treat. General updates and deletions are tricky since they may impact results quality if applied frequently.
Very interesting package. Thanks a lot for sharing this! I wonder how much faster it could go if instead of computing the exact distances as fitness it could leverage approximated distances with ADC like in PQLinearScann, along the lines of [2203.02505] ARM 4-BIT PQ: SIMD-based Acceleration for Approximate Nearest Neighbor Search on ARM . Julia seems an excellent language to build an Approximate Nearest Neighbour Search toolbox.
Happy to share that our JOSS paper about the
SimilaritySearch.jl package has been accepted for publication. It covers the last version (v0.9.3) with some speedups compared with
v0.8 due to several minor improvements and the use of the
Polyester.jl package instead of plain
Threads. We also updated documentation and examples https://sadit.github.io/SimilaritySearchDemos/.