Suppose I have a list of known vectors x_i \in \mathbb{R}^m for i = 1 \dots n.
Now my boss comes to me and gives me a vector y \in \mathbb{R}^m and says, find the i that minimizes
\| y - x_i \|.
Clearly, the best you can do is compute the norm for each i and find the smallest one, at computational cost O(nm).
Now, what changes if I am allowed to preprocess the x_i prior to my boss timing my code? In the m = 1 case, I could get clever and sort the (scalar) x_i values. Then I can find the optimal i using binary search, so (not counting the time to sort the list) the search takes only O(m \log n) time.
Is there a similar approach for the m-dimensional case?
I have heard whispers of something called a āvector databaseā that is optimized for these kinds of operations, but I suspect that anything with the word ādatabaseā in it is probably overkill for this simple (?) problem, and Iām wondering if thereās a straightforward data structure one could use.
(Note that you donāt get to know y at the preprocessing stage, only the x_i.)
Miss this communityāhavenāt been as active since changing jobs and doing almost everything in Python.
This is a problem where you get to pick 2 of fast, easy, and exact. The fast easy approximate solution is to pick k<m (think 3 or so) random vectors vā of length m and compute xįµ¢ā vā for each i,k. If ā„yāxįµ¢ā„ is small, then yįµ¢ā vāāxįµ¢ā vā. You do the binary searches on xįµ¢ā vā with the key yįµ¢ā vā, and find the ones that are close for all three. Making this algorithm robust and good takes some effort.
If Iām following right, this is exactly the problem of nearest neighbor search (in m-dimensional Euclidean space). Thereās a bunch of approximate and exact techniques to solve it (many at the Wikipedia link there).
I think you may want to look into space partitioning datastructures. Database technologies ALWAYS generate absurd hype. Whether they solve your problem or not I donāt know, but proceed with caution. Always benchmark, you will be surprised.
If n and m are known in advance, I guess that O(n m) = O(1), though?
Eric pointed to the Wikipedia page which lists (presumably) all the known algorithms, however Iād like to point out that thereās a way of improving the naive (linear search) solution in a āJulianā manner: move the data known in advance into the type domain.
Furthermore, to maximally exploit CPU architecture parallelism, the ālinearā search could be restructured as a binary tree (because n is known in advance).
Once all the data is in the type domain in an appropriate manner, and assuming the arithmetic (whatever represents \mathbb{R}) is some machine-native type like Float64, I think Julia could generate blazing fast code, especially if everything fits into the instruction cache.
But the move into the type domain necessitates a run time dispatch, so this canāt be efficient when n and m are small.
The task is \arg\min_x \lVert y - x \rVert which is equivalent to \arg\min_x \lVert y - x \rVert^2.The squared norm expands to \lVert y \rVert^2 + \lVert x \rVert^2 - 2x^Ty. The optimization is done w.r.t. x, so \lVert y \rVert^2 is just an additive constant and the problem reduces to \arg\min_x \lVert x \rVert^2 - 2x^Ty.
If the list of vectors x_i is fixed, you can precompute the squared norms \lVert x_i \rVert^2 for each i=1,\ldots, n. When your boss comes :), you only have to compute the dot products. The complexity remains \theta(n m) but it is less computation and it can be parallelized easily.
Edit: One more tweak: \arg\min_x \lVert x \rVert^2 - 2x^Ty = \arg\min_x \frac{1}{2} \lVert x \rVert^2 - x^Ty. So you can precompute \frac{1}{2} \lVert x_i \rVert^2 and save n multiplications when the boss comes!