Here are a few benchmarks for VPTrees, BKTrees, NearestNeighbors and linear search. Be aware though that random data is good for VPTree performance. Most interesting: For fast metrics, linear search is often faster. For expensive metrics, such as Levenshtein Distance, VPTrees is fastest. NearestNeighbors cannot accept Strings, which was my original use-case (with Levenshtein Distance).
using BenchmarkTools, Random, VPTrees, Distances, NearestNeighbors, BKTrees, DataStructures, StringDistances
Random.seed!(1);
rands = [rand(64) for _ in 1:10_000];
# Construction time
@btime VPTree(rands, (x,y) -> Distances.Minkowski(3.5)(x,y));
# 177.578 ms (116172 allocations: 7.15 MiB)
data = hcat(rands...);
@btime BallTree(data, Distances.Minkowski(3.5));
# 152.731 ms (32 allocations: 10.91 MiB)
# BKTrees are fastest in construction!
@btime BKTree((x,y) -> Distances.Minkowski(3.5)(x,y), rands);
# 21.077 ms (178625 allocations: 11.19 MiB)
const query = rand(64);
rands = [rand(64) for _ in 1:50_000];
vptree = VPTree(rands, (x,y) -> Distances.Minkowski(3.5)(x,y));
data = hcat(rands...);
btree = BallTree(data, Distances.Minkowski(3.5));
bktree = BKTree((x,y) -> Distances.Minkowski(3.5)(x,y), rands);
@btime inrange(btree, query, 1.4);
# 84.242 ms (10 allocations: 16.39 KiB)
@btime BKTrees.find(bktree, query, 2);
# 124.779 ms (250034 allocations: 8.87 MiB)
@btime VPTrees.find(vptree, query, 1.4);
# 77.486 ms (17 allocations: 1.00 MiB)
# VPTrees seems fastest, but for cheap metrics, it's hard to beat findall! (tested up to 1_000_000 random points)
@btime findall(x -> Distances.Minkowski(3.5)(x, query) < 1.4, rands);
# 65.777 ms (20 allocations: 1.00 MiB)
@btime knn(btree, query, 20);
# 80.167 ms (3 allocations: 512 bytes)
@btime BKTrees.find(bktree, query, 10; k=20);
# 126.905 ms (250034 allocations: 8.87 MiB)
@btime find_nearest(vptree, query, 20);
# 76.402 ms (8 allocations: 1.36 KiB)
# Once again, don't use trees for fast metrics!
@btime nsmallest(20, [(Distances.Minkowski(3.5)(query,r), i) for (i,r) in enumerate(rands)]);
# 65.691 ms (50011 allocations: 2.29 MiB)
metric(a::AbstractString, b::AbstractString) = evaluate(Levenshtein(), a,b);
randstrings = [randstring("01234456789") for _ in 1:50_000];
vptree = VPTree(randstrings, metric);
bktree = BKTree(metric, randstrings);
const stringquery = randstrings[rand(1:length(randstrings))];
@btime BKTrees.find(bktree, stringquery, 2);
# 39.493 ms (148066 allocations: 6.94 MiB)
@btime VPTrees.find(vptree, stringquery, 2);
# 9.668 ms (40066 allocations: 5.50 MiB)
# With expensive metrics, VPTrees are finally actually faster than linear search
@btime findall(a -> metric(stringquery, a) <= 2, randstrings);
# 12.637 ms (50005 allocations: 6.86 MiB)
# Unfortunately NearestNeighbors does not accept strings, so this does not work:
btree = BallTree(randstrings, Levenshtein())