Neighborhood.jl: Unified API for finding nearest neighbors in Julia

I’d like to announce Neighborhood.jl, a new Julia package that has the following purpose: Provide a unified API for finding nearest (or approximately nearest) neighbors in Julia. The API is exceptionally simple (documentation):

  1. Initialize the search structure by calling ss = searchstructure(TheTypeToUse, data, metric).
  2. Use it for making searches with search(ss, query, searchtype, ...; ...)
  3. searchtype is one of two types: WithinRange(::Real) and NeighborNumber(::k). Shorthands to search with inrange and knn are provided.
  4. Other methods that may (or may not) offer performance benefits depending on the type of ss are also provided.

The idea for this started in a previews post about VPTrees.jl, see here if interested. Me, @altre , @JonasIsensee and @zgornel were talking about making a unified interface.

Progress on this has been extremely slow, but I grew tired of constantly re-inventing the wheel at JuliaDynamics and constantly re-writing interfaces for NearestNeighbors.jl so Neighborhood.jl was published.

I documented the package as well as I could, and there is an example interfacing of KDTrees, with every single method being (to the best of my knowledge) the most performant possible. It is only my hope now that the community will feel compelled to interface more packages into Neighborhood.jl :slight_smile:

Having many packages in Neighborhood.jl is exceptionally useful for the scientific community. @JonasIsensee has done a small project where he has measured different packages and how they perform, and depending on the dimensionality of your data, how they are structured in their state space, how many searches you want, and how many searches you need to do in bulk, and whether you need filtering, it is not obvious which packages should be the fastest for your application. Neighborhood.jl allows you to immediately and trivially swap a nearest neighbor finding package in your algorithm by changing a single argument: TheTypeToUse (which could be KDTree, VPTree, etc.).

If you would like to contribute, that would be amazing. This issue lists packages that offer nearest (or approximately nearest) neighbor searching in Julia, and would be great to interface any of them. I’ve added a “Dev Docs” page in the documentation to make this easy.

21 Likes

This looks useful! But reading the dev docs I’m confused about the design:

Here is what you have to do to bring your package into Neighborhood.jl.

  • Add your package as a dependency to Neighborhood.jl.
  • Add the search type into the constant SSS in src/Neighborhood.jl and export it.
  • Then, proceed through this page and add as many methods as you can.

Wouldn’t it make more sense for each package to depend on Neighborhood.jl and extend its methods within the package code? That way Neighborhood.jl stays lightweight but all the packages can still share a common API.

Or am I misunderstanding the docs?

5 Likes

@ericphanson @dpsanders Initially that was the plan, to have a “FileIO” kind of feel. Unfortunately, I quickly realized this is impossible in this case, because you would have to convince the package authors to (1) actually answer you, (2) spent time merging your PR and (3) accept an extra dependency to their package. From my experience making Neighborhood.jl, even though this sounds something not so bad, it is not at all guaranteed that it will happen.

I can guarantee however that if you make a PR to Neighborhood.jl I will answer :slight_smile:

2 Likes

well, the idea outlined by @ericphanson sounds definitely more julian: a small package defines the interface that other packages extend. Array interface wouldn’t be so successful if to use it I would need something to be merged against Base, would it?

5 Likes

I think you’ve missed the point…

I absolutely agree with what you say. Feel free to try and convince the package authors on the points (1), (2) and (3) that I’ve outlined above!

1 Like

I think that you will eventually be undone by your own success here. If the architecture is such that every package that takes advantage of your API is a dependency to your API, then you will quickly approach Cuddle.jl in terms of dependencies, guaranteeing that this package won’t be practical for use by any other library developer, and eventually, for any user who needs this for a single use case.

3 Likes

@anon94023334 I feel like we have completely miss-communicated. I absolutely agree that the best scenario is by far that all packages that extend the Neighborhood.jl interface depend on it, instead of the other way around. This has been clear for me well before even starting Neighborhood.jl, but thank you for clarifying it further.

This is not practically possible at the moment because at least I couldn’t communicate this interface to anyone else outside myself and I do not own any package that does nearest neighbor searches. Of course, this situation is perfectly okay in the open source world, as not everyone might have time, or will, or just care, for some unified API at this very point in time.

Yet, I know that there are several Julia packages (at least some belonging in JuliaDynamics) that have big gains in using Neighborhood.jl API, and I’ve been re-inventing the wheel all this time for many of them, so this it was useful for me to do this now. The interface by itself is useless if you don’t have at least one package participating in this interface… My choices were: do nothing or do what I just did, and instead make the packages part of it.

It is of course my hope that this can change in the future, if more people become interested in this interface and we can go to the original plan of having a FileIO-like interface. But until then, to my understanding, this is the best working alternative. If I am wrong about this, please demonstrate a better alternative on how to go about it! This is the first time I am publishing unified interface for something I don’t actually own, so its really difficult to decide the best course of action. If I actually owned a package about nearest neighbors, this interface would already be there and we wouldn’t be having this discussion…

2 Likes

Why not, instead of adding new dependencies to Neighborhood.jl, just write a new package with the implementation you want? NeighborhoodMyPackage.jl. Then the user can load MyPacakge and NeighborhoodMyPackage?

6 Likes

I don’t know that you’re wrong about it, but we had a similar issue when we started up LightGraphs 5 years ago. Granted, the ecosystem was much smaller then, but perhaps some lessons learned from that might be instructive.

What I would recommend is to create a Neighborhood.jl package that defines an interface for the library you want. I’d then create a reference implementation that implements that interface, and show (via benchmarks, for example) why it’s better than what exists out there now. I’d then reach out to a couple of the heavy hitters in the domain space to show them, and offer to create a PR that replaces their existing code with code that depends on your library. It sounds like you’ve already started this, so… perhaps it’s just a matter of waiting until the code that uses Neighborhood.jl is merged?

If the performance/memory advantages to your library are as significant as you claim, this should be a no-brainer, especially if your own dependencies are kept to a minimum. This is important: as an example, LightGraphs used to have 44 transitive dependencies. This was way too many for us to ask folks to replace what they were doing with our code. When we cut that back to 7 or 8, however, it became much more feasible and our adoption rate started going through the roof relative to where it had been.

Similarly, now, LightGraphs has a policy that gives strong preference to contributions that add minimal dependencies.

Good luck with your package!

5 Likes

Here is me feeling guilty for not finding time to continue discussing the API. :frowning:

I had simply decided for myself to accept your suggestion and make VPTrees compatible as soon as you were finished. Looks like I missed the change of direction (or maybe didn’t get it in the first place?)

Perhaps you could just copy the API to a related package, then we could have both worlds?

Hi All, This looks like a great contribution, and even (unlike many contributed packages) has documentation! There is one thing holding me back from using this, however. Virtually every single invocation refers to an argument called 'query’ yet this is not defined anywhere or referred to anywhere in the documentation or the (?) help, nor are any potential choices listed, and there appear to be no usage examples anywhere. If I were an expert in this area, I am sure I would know what is being referred to, but I am not (I merely would like to use it). Can anyone please give me a hint as to what a ‘query’ here is, and what my choices are?
Thanks for any help.

It’s ultimately ‘what you’re searching for’.

There’s one example in the tests that generates an array of vectors, each with three floats inside. The query is a vector of three floats as well, this is the target of your search.

data = [[800, 700, 200], [1, 3, 9]]
query = [600, 600, 600]

You would find data[1] as the nearest neighbor using this query.

1 Like

Thanks so much for getting back to me, but I’m sorry I don’t understand at all. I would have thought, for instance, that a nearest negihbor search would just give the nearst 5 (say) points to each observation. I’m surprised that there doesn’t seem to be an easy way to ask just that.
You have given me an example of the variable ‘query’, but I have no idea what it represents conceptually, nor why it is necessary given that what one wants (the nearest neigbors) seems pretty clear to me. [Sorry if I am being thick]
Thanks again, and thanks also to yourself or anyone that might be able to shed a little more light on this.

That depends on what you want. Neighbors are relative.

nearest points to each observation

What is observation in your mind? If I’m standing in a crowd and I want to figure out who is closest to me, I’m the observation point.

If we distribute space out on a grid, perhaps I’m standing at position (2, 7), and there are three people nearby at positions (9, 15), (1, 6), (17, 7). It’s clear that if we want to find my nearest neighbour (using euclidean distance), we expect to return the person at (1, 6).

From my example above,

data = [[9, 15], [1, 6], [17, 7]]
query = [2, 7]

There are a number of ways of doing this search. Some simple to understand, some with large performance gains when the set of data to search is large. Neighborhood.jl is attempting to be a meta-package for many different ways of searching.

So to complete this search, we define a searchstructure

ss = searchstructure(BruteForce, data, Euclidean())

BruteForce is the algorithm we use to search. This one is the easy to understand but slow one. We want to search by Euclidean distance, so that’s our metric. What’s hidden here is how the data is arranged so that we can do an efficient search.

Then, we run the search using knn where k is the number of nearest neighbors you want to find.

closest = knn(ss, query, 1)  # find 1 neighbour closest to my location (query)

This is just shorthand for the search function. You can also do:

closest = search(ss, query, NeighborNumber(1)) 

The value of closest = ([2], [1.4142135623730951]), the first result is the index of the neighbor (data[2] = [1, 6]: correct!) and the second value is the calculated distance.

To generalise this to something closer to your assertion, nearest X points to each observation, lets extend data a bit:

data = [[9, 15], [1, 6], [17, 7], [5, 9], [11, 12], [1, 15], [12, 3]]
queries = [[9, 15], [1, 6]]
ss = searchstructure(BruteForce, data, Euclidean())
results = bulksearch(ss, query, NeighborNumber(3))

This looks for 3 nearest positions to the first two locations in the data.

Does that make sense?

3 Likes

Thanks so much. I see, so ‘query’ means reference point (the point at which you wish to evaluate, like the original data or a list of new examples). If I am right, then it was just the term ‘query’ that threw me. I think a brief mention of this in the documentation would go a long way! [e.g. query = points at which to evaluate].

To me the term ‘query’ I associate with database queries and I thought this was some
sort of structure of that sort.

Thanks again!

Sure thing. The maintainers are probably tracking this thread, but you could also put an issue in the tracker and link here.