[ANN] VPTrees.jl - nearest neighbor search in metric spaces

VPTrees.jl

A Vantage Point Tree is a simple data structure effective for use in fast nearest neighbor search and range search for any metric developed by Peter Yianilos: Data structures and algorithms for nearest neighbor search in general metric spaces.
It’s very simple to use, for example with Levenshtein distance using StringDistances.jl:

using VPTrees
using StringDistances

data = ["bla", "blub", "asdf", ":assd", "ast", "baube"]
metric = (a, b) -> evaluate(Levenshtein(),a,b)
vptree = VPTree(data, metric)
query = "blau"
radius = 2
data[find(vptree, query, radius)]
# 2-element Array{String,1}:
#  "bla" 
#  "blub"
n_neighbors = 3
data[find_nearest(vptree, query, n_neighbors)]
# 3-element Array{String,1}:
#  "baube"
#  "blub" 
#  "bla"
12 Likes

Given the proliferation of metric trees, would it make sense to merge them under one API and package (and along the way document which are appropriate for what kind of data)? Maybe nearest neighbors? I have a cover tree and some self-made construction lying around, ready to donate.

Many trivial algorithms (branch & bound search, approx branch & bound search) could work with most trees, as could some less trivial ones, like dual tree search.

1 Like

@altre Can you please do some benchmark comparisons with NearestNeighbors.jl?

@foobar_lv2 there is a Julia org: https://github.com/JuliaNeighbors maybe it is time to put them all there as a first step, and as a second step consider a unifying API?

4 Likes

There is also https://github.com/KristofferC/NearestNeighbors.jl by Kristoffer Carlsson.

It would be really nice all of them having the same api if possible and unify to the same package. It is a lot of work though.

3 Likes

Here are a few benchmarks for VPTrees, BKTrees, NearestNeighbors and linear search. Be aware though that random data is good for VPTree performance. Most interesting: For fast metrics, linear search is often faster. For expensive metrics, such as Levenshtein Distance, VPTrees is fastest. NearestNeighbors cannot accept Strings, which was my original use-case (with Levenshtein Distance).

using BenchmarkTools, Random, VPTrees, Distances, NearestNeighbors, BKTrees, DataStructures, StringDistances
Random.seed!(1);

rands = [rand(64) for _ in 1:10_000];
# Construction time
@btime VPTree(rands, (x,y) -> Distances.Minkowski(3.5)(x,y));
# 177.578 ms (116172 allocations: 7.15 MiB)
data = hcat(rands...);
@btime BallTree(data, Distances.Minkowski(3.5));
# 152.731 ms (32 allocations: 10.91 MiB)
# BKTrees are fastest in construction!
@btime BKTree((x,y) -> Distances.Minkowski(3.5)(x,y), rands);
# 21.077 ms (178625 allocations: 11.19 MiB)

const query = rand(64);
rands = [rand(64) for _ in 1:50_000];
vptree = VPTree(rands, (x,y) -> Distances.Minkowski(3.5)(x,y));
data = hcat(rands...);
btree = BallTree(data, Distances.Minkowski(3.5));
bktree = BKTree((x,y) -> Distances.Minkowski(3.5)(x,y), rands);
@btime inrange(btree, query, 1.4);
# 84.242 ms (10 allocations: 16.39 KiB)
@btime BKTrees.find(bktree, query, 2);
# 124.779 ms (250034 allocations: 8.87 MiB)
@btime VPTrees.find(vptree, query, 1.4);
# 77.486 ms (17 allocations: 1.00 MiB)
# VPTrees seems fastest, but for cheap metrics, it's hard to beat findall! (tested up to 1_000_000 random points)
@btime findall(x -> Distances.Minkowski(3.5)(x, query) < 1.4, rands);
# 65.777 ms (20 allocations: 1.00 MiB)

@btime knn(btree, query, 20);
# 80.167 ms (3 allocations: 512 bytes)
@btime BKTrees.find(bktree, query, 10; k=20);
# 126.905 ms (250034 allocations: 8.87 MiB)
@btime find_nearest(vptree, query, 20);
# 76.402 ms (8 allocations: 1.36 KiB)
# Once again, don't use trees for fast metrics!
@btime nsmallest(20, [(Distances.Minkowski(3.5)(query,r), i) for (i,r) in enumerate(rands)]);
# 65.691 ms (50011 allocations: 2.29 MiB)

metric(a::AbstractString, b::AbstractString) = evaluate(Levenshtein(), a,b);
randstrings = [randstring("01234456789") for _ in 1:50_000];
vptree = VPTree(randstrings, metric);
bktree = BKTree(metric, randstrings);
const stringquery = randstrings[rand(1:length(randstrings))];
@btime BKTrees.find(bktree, stringquery, 2);
# 39.493 ms (148066 allocations: 6.94 MiB)
@btime VPTrees.find(vptree, stringquery, 2);
# 9.668 ms (40066 allocations: 5.50 MiB)
# With expensive metrics, VPTrees are finally actually faster than linear search
@btime findall(a -> metric(stringquery, a) <= 2, randstrings);
# 12.637 ms (50005 allocations: 6.86 MiB)
# Unfortunately NearestNeighbors does not accept strings, so this does not work:
btree = BallTree(randstrings, Levenshtein())
5 Likes

Thanks for sharing these benchmarks. At JuliaDynamics we are constantly doing a lot of nearest neighbor searching and it will be interesting to compare these three packages also for the datasets that we use there.

They are fundamentally different in the sense that they are composed of low-dimensional static arrays. I know that NearestNeighbors have been optimized a lot for these kind of datasets.

In JuliaDynamics, specifically the “Base” package for datasets, I have defined a neighborhood interface that links to NearestNeighbors.jl . It would be trivial to extend this to accommodate all packages. I only have to define dispatch for the second argument (currently only “tree”) of the neighborhood function

maybe @altre , @kristoffer.carlsson and @JonasIsensee and @zgornel you wouldbe interested into bringing everything together in JuliaNeighbors ? I’d be happy to contribute the common interface via neighborhood or something similar.

3 Likes

Trivial is probably a bit too optimistic.
For example Approximate NN libraries like https://github.com/JuliaNeighbors/HNSW.jl
will need additional arguments.

But I agree that it would be worthwhile to work on some kind of shared API.

Makes sense to me. I am a bit confused about what is the approach to transfer a registered package from one organization/account (specifically for IVFADC.jl and BKTrees.jl )

Some sort of “informal” API proposal for NN/ANN methods should also exist somewhere.

approach to transfer a registered package

You just transfer it by going to the Settings page. Nothing breaks because github will autolink your repo to the moved one. But one typically also does a PR at the GeneralRegistry that changes a single url.

Here is my proposal for the universal API, composed of two functions

datastructure = whatever_structure_each_package_uses
neighborhood(query, datastructure, ntype; kwargs...)

It uses the “datastruture” to find nearest neighbors of the query. As there are (almost) always two types of neighbhorhood (we use FixedSize(ε::Real) and FixedMass(k::Int)) that find either all neighbhors in range ε or the k nearest neighbors.

Each package just has to implement two versions of neighborhood.

Can you be more explicit here? Im using the registrator and tag bots. Im reluctant to break things because I use some of this stuff in production and experimenting is not an option. Hopefully, there is a procedure somewhere to transfer a registered package and update the general registry; im not aware of any as this is a pretty rare case.

We obviously have different needs and views. I for example am interested in interfaces to work with indexed points, mostly pusing, poping, insertion and delition (with index update). AFAIK, only IVFADC supports push!/pop!/deleteat! etc. Also, range searches are optional…

We obviously have different needs and views

I see. Thanks for showing this side of the coin.

You are right, and I also cannot imagine a common API that satisfy both the needs from my field, as well as yours.

I have done this transfer of packages many times already, across several different organizations and nothing broke. In fact, everything work just like before without any change. But I can say for sure that I am not willing to promise anything. In the end, a transfer is not really necessary. It helps the community, but that’s about it.

The biggest concern is that the common API does not seem possible at all given the difference in needs.

No need to give up just yet.
Given the various features supported by different algorithms it does not make sense
to limit ourselves to a common API. That does not mean that we can’t add one that would make benchmarks and comparisons easier.
Im thinking of something along the lines of

ds = produce_structure(MyAlgType, data; kwargs...) # Initialize structure
do_preparatory_work!(ds; kwargs....) # Build your tree, graph or whatever
search(ds, query; kwargs...) # Do your searches 

# Implement these, if your structure allows that.
insert!(ds, new_point) # is new_point an actual vector or just an index to somewhere?
delete!(ds, point) 

This or something similar would hopefully provide an interface to switch out different algorithms for comparison purposes without having to change much code.

What are your thoughts?

1 Like

I like a lot the piece of code you paste. I would still argue that search should have 3 arguments, because searching inrange and searching knn is at a fundamental level different enough of an operation that I wouldn’t use a keyword for it.

But other than that what you suggest seems fine for me, and easy ti implement as well.

Other things to consider would be return values.
Some structures might return points, or indices, and maybe not the distances?

Seems there is interest. Let’s continue the conversation over here: https://github.com/JuliaNeighbors/Neighborhood.jl/issues/1 to keep things localized.

@zgornel @altre and @kristoffer.carlsson if you are interested you are welcome to join the conversation there.

Hi, I am not currently working on the nearest neighbors problem, but I am interested in the metric space interoperability in general. For example, my package DirectSum.jl is a repository for organizing metric spaces and vector bundle manifolds (abstractly). These vector bundles with a metric can be used in AbstractTensors.jl and Grassmann.jl to specify the metric of the statically allocated tensor algebra elements I work with (similar to what @Datseris is probably in need of). Not sure if this is entirely relevant to the discussion here, but my general intention with these packages is to provide a universal interface for computing in manifold spaces having a metric specification (and direct sum interoperability). If this kind of interoperability interests you, please let me know, but maybe my needs are completely different from the needs talked about here.

Also, I happen to have my own unrelated binary tree package called Dendriform.jl, which I plan to revise and update soon, sorry if this is getting off topic, but the whole tree and metric space thing resonates with me a bit, so please let me know if the metric spaces I am talking about might be useful for perhaps @Datseris needs for metrics.

That’s good enough for me. Could you make an idiot’s guide to the steps (no pressure really)? Thanks :wink:

Well, any recommendation would be a plus for anyone that is thinking of contributing with new packages. For example, preferred returned format i.e. Tuple{Vector{<:Integer}, Vector{<:AbstractFloat}}, methods names i.e. knn/knn_search etc.

Will do.

It’s simpler than what you’d expect actually :wink:

  1. Transfer repo by going to its settings.
  2. Add JuliaRegistrator / tagbot in the new host organization. But I think @JonasIsensee has already done that for JuliaNeighbors
  3. Do A PR in the General Registry changing e.g. this line (or the corresponding one for the transferred package) to the new host url.

Alright, thanks. Will try in the following days.
EDIT: All done :wink:

2 Likes