[ANN] Faiss.jl, similarity search

zsz00/Faiss.jl: Julia wrapper around the Faiss library for similarity search with PythonCall.jl (github.com)

A simple Julia wrapper around the Faiss library for similarity search with PythonCall.jl .

While functional and faster then NearestNeighbors.jl .

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research.


Sounds cool! Do you have any benchmarks showing the performance difference between Faiss and NearestNeighbors?

1 Like

A simple test comparison :

ENV["JULIA_PYTHONCALL_EXE"] = "/home/xx/miniconda3/bin/python"
using Faiss
using NearestNeighbors

function faiss_test_1()
    feats = rand(10^6, 128);
    top_k = 100
    vs_gallery = feats;
    vs_query = feats[1:10^4, :];

    D, I = local_rank(vs_query, vs_gallery, k=top_k, metric="IP", gpus="")


function nn_test()
    data = rand(128, 10^6)
    k = 100
    query = rand(128, 10^4)

    kdtree = KDTree(data)
    idxs, dists = knn(kdtree, query, k, true)

@time faiss_test_1()
@time nn_test()

used time:
28.934969 seconds (4.76 M allocations: 1.699 GiB, 0.14% gc time, 8.35% compilation time)
1197.160527 seconds (1.26 M allocations: 2.981 GiB, 0.02% gc time, 0.12% compilation time)

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
  JULIA_PYTHONCALL_EXE = /home/xx/miniconda3/bin/python

ANN-Benchmarks is a Comprehensive benchmarking, but there is no NearestNeighbors.jl

A simple analogy:

NearestNeighbors.jl vs Faiss such as Flux.jl vs PyTorch.

I am not sure that I entirely agree here. Your benchmark above suggests that Faiss is about 40x faster than NearestNeighbors, while some benchmarks here: Flux – Accelerating Flux.jl with PyTorch kernels suggest that Torch is about 2-3x faster than Flux.

Also, I am not entirely convinced that your benchmark is fair. My understanding is that NearestNeighbors doesn’t use an approximate algorithm and that threading is not implemented yet. Wouldn’t both of those things be turned on by default in Faiss?

Putting aside that small quibble, I definitely think that what you have done here is extremely valuable and could be a massive boost for the Julia ML community. I can definitely see a future where NearestNeighbors can be pointed to a Faiss backend, or where Julians directly use your Faiss wrapper for similarity search. Plus, having Faiss so easily accessible in Julia would help Julia developers benchmark against the state of the art. That way people could make informed decisions on whether it is better to continue developing nearest neighbor algorithms in Julia, or whether we should just go with a wrapper and redirect Julia developer time elsewhere.


Ideally, you’d skip Python and call Faiss’s underlying C API directly (building a Faiss binary with Yggdrasil), no?


That would be best, but I don’t know how to do it.
Currently this way of wrapping Python is the simplest and compatible with the Python interface

Yes, NN can only use single thread, faiss automatically uses multiple threads.

One of the reasons for not making a strict comparison is that NN has too few features, does not support multithreading, does not support GPU, and does not support various indexes .
Dynamic index and add data are not supported.

My tests were rough, but from experience, the above tests didn’t have much of a problem.

Plus, you can run your own comparison test.

One could test Faiss against IVFADC.jl which uses ANN search (hierarchical indexing) and product quantization for vector residuals.

Having a vector search algorithm single-threaded and CPU-only can also be an extremely powerful feature: simpler implementation, easier to reason about, no GPU dependency (good for low-resource devices), parallelization at a higher abstraction level (i.e. process), which may be desirable in environments where inbound requests trigger the search process.

Another thing to consider is, the scaling of search algorithms with data dimension. (both nominal and effective)
Here’s another ANN library that I implemented a while back:

See my 5-minute old PR:
Add multithreading support by KronosTheLate · Pull Request #143 · KristofferC/NearestNeighbors.jl (github.com)

Recently there was this package announced as well:

I’ve been using this library for a while … *And then I have to use faiss again… so i do the faiss.jl
I do large-scale high-dimensional data clustering


Hello @zsz00, I worked on standardizing the API, several performance improvements, and an auto-tuning feature. In any case, Faiss is great when you can use GPUs or have a huge database (applying PQ).

In MHO, the SimilaritySearch package should be used on medium-sized datasets (e.g., a few million?) with high dimensionality. Among its nice features: it is written in pure Julia and supports many distance functions; it is multithreaded and has a novel auto-configuration feature.

I made a few changes in the benchmark to make it work with SimilaritySearch (quite long, indeed, there will be some bugs)

using Pkg, Faiss, SimilaritySearch, LinearAlgebra, JLD2

function create_benchmark(dim, n, m)
    X = rand(Float32, dim, n)
    Q = rand(Float32, dim, m)
    for c in eachcol(X) normalize!(c) end
    for c in eachcol(Q) normalize!(c) end
    X, Q

function create_gold(X, Q, k, dist)
    db = MatrixDatabase(X)
    queries = MatrixDatabase(Q)
    ex = ExhaustiveSearch(; db, dist)
    searchbatch(ex, queries, k; parallel=true)

function simsearch_test(X, Q, k, dist)
     db = MatrixDatabase(X)
     k = 100
     queries = MatrixDatabase(Q)
     G = SearchGraph(; db, dist)
     index!(G; parallel_block=512)
     searchbatch(G, queries, k; parallel=true)

function faiss_index(X, Q, k)
	dim, n = size(X)
	str = "HNSW,Flat"
	# str = "Flat"
	metric = "IP"
	gpus = ""
	idx = Faiss.Index(dim; str, metric, gpus)  # init Faiss Index
	show(idx)   # show idx info
	add(idx, permutedims(X))
	D, I = Faiss.search(idx, permutedims(Q), k)
	I, D = permutedims(I), permutedims(D)
	I .+= 1
	I, D

function faiss_test(X, Q, k)
     vs_gallery = permutedims(X)
     vs_query = permutedims(Q)
     D, I = local_rank(vs_query, vs_gallery, k=k, metric="IP", gpus="")
     I, D = permutedims(I), permutedims(D)
	 I .+= 1
	 I, D

function main()
	dim = 128
	k = 100
	dbfile = "benchmark-$dim.jld2"
	goldfile = "gold-$dim.jld2"
	faissfile = "faiss-$dim.jld2"
	faissindexfile = "faissindex-$dim.jld2"
	simfile = "sim-$dim.jld2"
	dist = NormalizedCosineDistance()

	X, Q = if isfile(dbfile)
		load(dbfile, "X", "Q")
		X, Q = create_benchmark(dim, 10^6, 10^3)
		jldsave(dbfile, X=X, Q=Q)
		X, Q

	@info "== gold"
	gI, gD, exhaustivetime = if isfile(goldfile)
		load(goldfile, "gI", "gD", "exhaustivetime")
		exhaustivetime = @elapsed gI, gD = create_gold(X, Q, k, dist)
		jldsave(goldfile; gI, gD, exhaustivetime)
		gI, gD, exhaustivetime
	@show exhaustivetime
	@info "== faiss"
	fI, fD, faisstime = if isfile(faissfile)
		load(faissfile, "fI", "fD", "faisstime")
		faisstime = @elapsed fI, fD = faiss_test(X, Q, k)
		jldsave(faissfile; fI, fD, faisstime)
		fI, fD, faisstime
	@show faisstime

	@info "== faiss index"
	fiI, fiD, faissindextime = if isfile(faissindexfile)
		load(faissindexfile, "fiI", "fiD", "faissindextime")
		faissindextime = @elapsed fiI, fiD = faiss_index(X, Q, k)
		jldsave(faissindexfile; fiI, fiD, faissindextime)
		fiI, fiD, faissindextime
	@show faissindextime
	@info "== SearchGraph"
	sI, sD, simtime = if isfile(simfile)
		load(simfile, "sI", "sD", "simtime")
		simtime = @elapsed sI, sD = simsearch_test(X, Q, k, dist)
		jldsave(simfile; sI, sD, simtime)
		sI, sD, simtime
	@show simtime
	@show faisstime => macrorecall(gI, fI)
	@show faissindextime => macrorecall(gI, fiI)
	@show simtime => macrorecall(gI, sI)


For instance, I ran it with julia -t64 in a 32 core workstation with hyperthreading (without GPUs). The functions also need to specify that you want multithreading, as seen in the code.

The results are:

faisstime => macrorecall(gI, fI) = 39.756926338 => 0.99999
faissindextime => macrorecall(gI, fiI) = 89.104171865 => 0.061250000000000054
simtime => macrorecall(gI, sI) = 114.653610227 => 0.4511400000000003

However, the ExhaustiveSearch computes the entire batch in 3.06s. So, the best approach for this setup and benchmark is brute force; larger query sets will perform differently. Note that SimilaritySearch achieves a pretty bad 0.45 recall (the recall goes from 0 to 1, higher is better), nonetheless, the HSNW of FAISS (with default setup) archives ~0.06 of recall. These bad performances are because the dimension of the dataset is pretty high. The SearchGraph index performs better than HNSW because even when it uses the default setup, it has an auto-tuning among its features (by default, it tries to get a good tradeoff between search speed and quality).

It is possible to improve the recall adding the following line after the index! line

optimize!(G, OptimizeParameters(kind=MinRecall(0.9)))

This will try to adjust its hyperparameters to reach the specified recall, and the result is:

simtime => macrorecall(gI, sI) = 93.36208313 => 0.8652799999999999

The small speed up is also a nice unexpected improvement due to the optimized hyperparameters.

I used SimilaritySearch v0.8.12 and Faiss v0.3.1 for this comparison.

Best regards,

I test the code, the results are pretty much what you reported. default HNSW index recall is really bad.
I don’t know much about HNSW, I usually use Flat, IVF type index more.
But I see that the Faiss-HNSW in * ann-benchmarks works quite well.

I tried “HNSW64,Flat” and “HNSW32,Flat”, recall < 0.1, really is not high.

I found a bug in Faiss HNSW that does not support IP metric, I wonder if this will affect the results.

You are using random vectors in a 128-dimensional unit hypercube, which is pretty hard for most metric indexes so they need to be tuned. The ann-benchmarks use a grid of configurations, and they use pretty large numbers in particular in efC and the number of neighbors (M I think). These parameters increase computational requirements, mainly, construction time and memory usage. The search time is also impacted.

Regarding the use of IP, I think it does, but you need to normalize vectors (this process converts the metric to computing the cosine/the angle). If you use it without normalizing vectors, the algorithm may not take the correct information for routing/searching. The same issue happens with the SearchGraph since they use the same principle, but with a different algorithm.

I don’t know the target application, but I think that If you need to use an index with the inner product without normalization perhaps you need to use a Minkowski metric instead.