Do I need a vector database to solve this problem?

Suppose I have a list of known vectors x_i \in \mathbb{R}^m for i = 1 \dots n.

Now my boss comes to me and gives me a vector y \in \mathbb{R}^m and says, find the i that minimizes

\| y - x_i \|.

Clearly, the best you can do is compute the norm for each i and find the smallest one, at computational cost O(nm).

Now, what changes if I am allowed to preprocess the x_i prior to my boss timing my code? In the m = 1 case, I could get clever and sort the (scalar) x_i values. Then I can find the optimal i using binary search, so (not counting the time to sort the list) the search takes only O(m \log n) time.

Is there a similar approach for the m-dimensional case?

I have heard whispers of something called a “vector database” that is optimized for these kinds of operations, but I suspect that anything with the word “database” in it is probably overkill for this simple (?) problem, and I’m wondering if there’s a straightforward data structure one could use.

(Note that you don’t get to know y at the preprocessing stage, only the x_i.)

Miss this community—haven’t been as active since changing jobs and doing almost everything in Python.


This is a problem where you get to pick 2 of fast, easy, and exact. The fast easy approximate solution is to pick k<m (think 3 or so) random vectors vₖ of length m and compute xᵢ⋅vₖ for each i,k. If ∥y−xᵢ∥ is small, then yᵢ⋅vₖ≈xᵢ⋅vₖ. You do the binary searches on xᵢ⋅vₖ with the key yᵢ⋅vₖ, and find the ones that are close for all three. Making this algorithm robust and good takes some effort.


If I’m following right, this is exactly the problem of nearest neighbor search (in m-dimensional Euclidean space). There’s a bunch of approximate and exact techniques to solve it (many at the Wikipedia link there).


I think you may want to look into space partitioning datastructures. Database technologies ALWAYS generate absurd hype. Whether they solve your problem or not I don’t know, but proceed with caution. Always benchmark, you will be surprised.

1 Like

If n and m are known in advance, I guess that O(n m) = O(1), though?

Eric pointed to the Wikipedia page which lists (presumably) all the known algorithms, however I’d like to point out that there’s a way of improving the naive (linear search) solution in a “Julian” manner: move the data known in advance into the type domain.

Furthermore, to maximally exploit CPU architecture parallelism, the “linear” search could be restructured as a binary tree (because n is known in advance).

Once all the data is in the type domain in an appropriate manner, and assuming the arithmetic (whatever represents \mathbb{R}) is some machine-native type like Float64, I think Julia could generate blazing fast code, especially if everything fits into the instruction cache.

But the move into the type domain necessitates a run time dispatch, so this can’t be efficient when n and m are small.

You might also preprocess your data.

The task is \arg\min_x \lVert y - x \rVert which is equivalent to \arg\min_x \lVert y - x \rVert^2.The squared norm expands to \lVert y \rVert^2 + \lVert x \rVert^2 - 2x^Ty. The optimization is done w.r.t. x, so \lVert y \rVert^2 is just an additive constant and the problem reduces to \arg\min_x \lVert x \rVert^2 - 2x^Ty.

If the list of vectors x_i is fixed, you can precompute the squared norms \lVert x_i \rVert^2 for each i=1,\ldots, n. When your boss comes :), you only have to compute the dot products. The complexity remains \theta(n m) but it is less computation and it can be parallelized easily.

Edit: One more tweak: \arg\min_x \lVert x \rVert^2 - 2x^Ty = \arg\min_x \frac{1}{2} \lVert x \rVert^2 - x^Ty. So you can precompute \frac{1}{2} \lVert x_i \rVert^2 and save n multiplications when the boss comes!

1 Like

Sounds like exactly the problem for GitHub - KristofferC/NearestNeighbors.jl: High performance nearest neighbor data structures and algorithms for Julia.


If I’m understanding this correctly, you have to compute n dot products of m-vectors, so this is still O(nm) complexity, right?


Yes, the complexity is the same.