Closest distance between two Euclidian grids

cadojo · May 15, 2021, 11:46pm

I have two large data sets: A (size 1_500_000, 3), and B (size 1_500, 3). I’d like to find the closest point between A and B. I could iterate through each possible combination, but that is O(n^2) and is not possible to complete in a reasonable time for datasets this size.

I asked a similar question for one dimension here. I understand sorting the data would be helpful, but I’m not sure how to apply this to a 2-dimensional case.

Is there a package that can compute the closest point between two Euclidian grids?

rafael.guerra · May 16, 2021, 12:01am

Did you check NearestNeighbors.jl?

Including herein a MWE, that hopefully is correct:

using NearestNeighbors, Random, Printf

# create input data:
Random.seed!(1);    A = rand(3, 1_500_000)
Random.seed!(1234); B = rand(3, 1_500)

kdtree = KDTree(A)
idxs, dists = nn(kdtree, B)  # convenience function for k=1
ixb = argmin(dists)
ixa = idxs[ixb]
@printf("Minimum distance = %.5f, at:", dists[ixb])
@printf("A[:,%i] = (%.4f, %.4f, %.4f)\n", ixa, A[:,ixa]...)   # splatted values supported
@printf("B[:,%i] = (%.4f, %.4f, %.4f)\n", ixb, B[:,ixb]...)   #

# Printed results:
Minimum distance = 0.00045, at:
A[:,1080399] = (0.3572, 0.1117, 0.4786)
B[:,1136] = (0.3571, 0.1119, 0.4790)

# QC solution:
using LinearAlgebra
julia> dists[ixb] ≈ norm(A[:,ixa] -  B[:,ixb])
true

lmiq · May 16, 2021, 1:32am

In the field of molecular simulations we always rely on Cell lists to compute short range distances. That could be used to compute the nearest neighbor in roughly O(n) time.*

I am curious about the algorithms implemented in the package above. Anyone can point a reference of the algorithms implemented there?

*It has been done: “Near-neighbor calculations using a modified cell-linked list method - ScienceDirect” Near-neighbor calculations using a modified cell-linked list method - ScienceDirect

Edit: I understand this is the type of thing NearestNeighbors.jl is doing: k-d tree - Wikipedia

rafael.guerra · May 16, 2021, 1:47pm

Do not know if the nearest neighbor algorithm is among the most efficient to answer OP’s question.

In the MWE example above, it seems that one may need two steps: (i) find in A the 1500 nearest neighbors of B, and (ii) find the nearest among the nearest.

There may be a better solution while using NearestNeighbors.jl.

cadojo · May 16, 2021, 4:16pm

Love the Julia community, this is perfect. Nearest neighbor is found in just a few seconds, where my previous O(n^2) implementation never returned. Thanks for pointing this out!

PetrKryslUCSD · May 16, 2021, 7:24pm

Isn’t this two one-dimensional searches really?

cadojo · May 16, 2021, 7:34pm

Might be missing something, but I don’t see how.

I have a collection of departure points that I want to travel from: A. And I have a collection of arrival points that I want to travel to: B.

I want to find the pair of points (one from A, one from B) that are the closest together. The naive solution for this is…

minimum = Inf
optimal = ([NaN, NaN, NaN], [NaN, NaN, NaN])
for departure in A
    for arrival in B
        difference = norm(departure .- arrival)
        if difference < minimum
            minimum = difference
            optimal = (departure, arrival)
        end
    end
end

rafael.guerra · May 16, 2021, 7:39pm

Please check this article out, that points to a Julia solution here.

NB: unfortunately, the code posted there is labelled “Brute-force algorithm”

PetrKryslUCSD · May 16, 2021, 7:40pm

Sort in X, Y, Z using the Manhattan norm. Then sequential search.

lmiq · May 16, 2021, 7:54pm

If I understand what the Manhattan norm is, computing it alone would already be a O(n*m) computation.

(The algorithms must avoid looping over all pairs of points, and for that some sort of spacial classification is needed)

@cadojo one curiosity: since your points are three dimensional, I thought you were dealing with atoms. What kind of travel that is though? Between galaxies?

PetrKryslUCSD · May 16, 2021, 8:33pm

I misunderstood the OP.

lmiq · May 16, 2021, 9:52pm

Just curiosity, I implemented a (non-general, not very optimized) form of the cell lists above. It resulted to be be ~8 times slower than the method implemented in NearestNeighbors.jl. I will have to check if the method there applied to my actual use case . I hope it does, having it 8 times faster won’t be bad at all.

NN(A, B) = (0.00044736197216615207, 1080399, 1136)
  0.918342 seconds (7.54 k allocations: 96.740 MiB)
MD(A, B) = (0.000447361972166152, 1080399, 1136)
  7.259169 seconds (4 allocations: 11.452 MiB, 0.11% gc time)
(0.000447361972166152, 1080399, 1136)

import Pkg
Pkg.activate("nn")

using BenchmarkTools
using NearestNeighbors, Random, Printf

N1 = 15 * 10^5
N2 = 15 * 10^2

# create input data:
Random.seed!(1);    A = rand(3, N1)
Random.seed!(1234); B = rand(3, N2)

# NearestNeighbors algorithm:
function NN(A,B)
  kdtree = KDTree(A)
  idxs, dists = nn(kdtree, B)  # convenience function for k=1
  ixb = argmin(dists)
  ixa = idxs[ixb]
  return dists[ixb], ixa, ixb
end

@show NN(A,B)
@time NN(A,B)

function icell3D(x, n)
  i = trunc(Int,(x[1]/n))+1
  j = trunc(Int,(x[2]/n))+1
  k = trunc(Int,(x[3]/n))+1
  return i, j, k
end

function MD(A,B)

  d = 0.1 # assume that we know that the shortest distance is smaller than this
  n = floor(Int,1/d)

  # cell lists
  fatomA = zeros(Int,n,n,n) # first atom of A in the cell
  natomA = zeros(Int,size(A,2)) # next atom of A in the cell
  # fill cells
  for i in axes(A,2)  
    ic, jc, kc = icell3D(@view(A[:,i]),n)
    natomA[i] = fatomA[ic,jc,kc]
    fatomA[ic,jc,kc] = i
  end

  iamin = 0
  ibmin = 0
  dmin = +Inf
  # loop over atoms of B
  @inbounds for ib in axes(B,2)
    xb = @view(B[:,ib])
    i, j, k = icell3D(xb,n)
    # Loop over vicinal cells to compute distances to solvent atoms, and
    # add data to dc structure (includes current cell)
    for ic in i-1:i+1
      (ic == 0 || ic > n) && continue
      for jc in j-1:j+1
        (jc == 0 || jc > n) && continue
        for kc in k-1:k+1
          (kc == 0 || kc > n) && continue
          ia = fatomA[ic,jc,kc]
          while ia > 0 
            dij = (A[1,ia]-B[1,ib])^2 + 
                  (A[2,ia]-B[2,ib])^2 + 
                  (A[3,ia]-B[3,ib])^2
            if dij < dmin
              dmin = dij
              iamin = ia
              ibmin = ib
            end
            ia = natomA[ia]
          end
        end
      end
    end
  end
  
  return sqrt(dmin), iamin, ibmin
end

@show MD(A,B)
@time MD(A,B)

cadojo · May 17, 2021, 7:58pm

What kind of travel that is though? Between galaxies?

Travel in our solar system! I’m using invariant manifolds within the Circular Restricted Three-body Problem to find low-cost transfers from Earth to Jupiter. Repo here!

lmiq · May 17, 2021, 9:04pm

Cool! And why there are so many objects?

cadojo · May 17, 2021, 9:31pm

I’m trying to find the intersection of a manifold that departs the Sun-Earth system, and a manifold that converges to a Halo orbit in the Sun-Jupiter system. I’d ideally like to find a machine-level precise intersection between the manifold positions. Once I find two positions (one in each manifold) that are nearly identical, I can treat the difference in velocity between the two overlapping points in the manifolds as an impulsive maneuver.

I don’t have any analytical bounds for the manifolds, I just have the raw manifold positions and velocities that I can compute. So, to find a precise intersection, I’m collection a lot of data points, with the hope that there’s a better chance there is an intersection.

Unfortunately, it’s seeming like the closest approach between the two manifolds I can find (using a numerical optimizer with Sun-Jupiter Halo orbit amplitude as the single parameter) is around 3000 km. This is not machine level precision! I could use interpolation between data points, but I’ve received guidance that interpolation isn’t ideal here.

So my current idea is to build a better optimization. I’m thinking of collection a lot of orbital states within the Sun-Earth Halo, and the Sun-Jupiter Halo. I’ll then use a box optimization with the following parameters:

Phase angle of the Sun-Earth Halo departure (onto the departure manifold)
Phase angle of the Sun-Jupiter Halo departure (onto the arrival manifold)
Sun-Earth departure perturbation amplitude
Sun-Jupiter arrival perturbation amplitude

So rather than compute several hundred trajectories off of each Halo orbit, and use nearest neighbors to find the closest approach between the two manifolds, I’ll have the optimizer…

Pick parameters
Propagate the departure and arrival trajectories for some pre-set duration
Find the closest approach between the two trajectories
Return the distance between the two closest orbital states as the cost for the optimizer to minimize

If anyone has a better idea for how to do this, it would be much appreciated!

As a brief self-plug, all of this functionality is available in GeneralAstrodynamics on GitHub and Julia’s general registry. I’m planning on polishing this up and doing a package announcement for a 1.0 release this summer!

cortner · May 17, 2021, 10:08pm

FWIW cell lists are implemented in

lmiq · May 17, 2021, 11:23pm

I implemented cell lists in the ComplexMixtures package, with support to periodic boundaries (with straight angles only). Do you have some easy benchmark so I can compare?

I should have made it more shearable, but when I started that I didn’t understand the modularity that Julia offers. Now I have to do some rewriting to make that useful to others if that is the case.

cortner · May 17, 2021, 11:32pm

I used to have some benchmarks to compare against matscipy. I might be able to revive them. Would be great to combine codes (or get rid of one) since mine is still missing some functionality

cortner · May 17, 2021, 11:33pm

PS - it’s fairly well debugged and should Desk with arbitrary cell shapes. Not sure whether it is robust when the cell collapses though

lmiq · June 5, 2021, 3:20am

Just if you are interested. I have put up a package to compute function mappings dependent on short-ranged pairwise particle distances, and it can be used to compute the minimum distance between two sets of points. Depending on what you know from your point distribution, it can be faster than the nearest neighbour search above. It is here: https://github.com/m3g/CellListMap.jl

I added how to compute the same thing above in the example, and got (with 4 threads):

NN(A, B) = (0.00044736197216615207, 1080399, 1136)
  987.730 ms (7541 allocations: 96.74 MiB)
getmind(x, y) = (0.000447361972166152, 1136, 1080399)
  71.849 ms (141 allocations: 22.90 MiB)

single threaded I get 266.265 ms for getmind.

The “box side” and the “cutoff” must be tunned with the knowledge of your system. In particular, if you do not have periodic boundary conditions, as was the case here, I had to set the box side to be larger than the maximum dimensions plus the cutoff. Of course the time is very much dependent on the cutoff used, and there is some optimal value.

Code

import Pkg
Pkg.activate("nn")

#Pkg.add(url="https://github.com/m3g/CellListMap.jl")
using CellListMap, StaticArrays
using BenchmarkTools, Test
using NearestNeighbors, Random, Printf

N1 = 15 * 10^5
N2 = 15 * 10^2

# create input data:
Random.seed!(1);    A = rand(3, N1)
Random.seed!(1234); B = rand(3, N2)

function NN(A,B)
  kdtree = KDTree(A)
  idxs, dists = nn(kdtree, B)  # convenience function for k=1
  ixb = argmin(dists)
  ixa = idxs[ixb]
  return dists[ixb], ixa, ixb
end

x = [ SVector{3,Float64}(B[:,i]) for i in 1:N2 ] 
y = [ SVector{3,Float64}(A[:,i]) for i in 1:N1 ] 

function getmind(x,y)
  
  # Number of particles, sides and cutoff
  sides = [1.2,1.2,1.2]
  cutoff = 0.03
  box = Box(sides,cutoff)

  # Initialize auxiliary linked lists (largest set!)
  lc = LinkedLists(length(y))

  # Initializing linked cells with these positions (largest set!)
  initcells!(y,box,lc)

  # Function that keeps the minimum distance
  f(i,j,d2,mind) = d2 < mind[1] ? (d2,i,j) : mind

  # We have to define our own reduce function here for the tuple
  function reduce_mind(output_threaded)
    mind = output_threaded[1]
    for i in 2:Threads.nthreads()
      if output_threaded[i][1] < mind[1]
        mind = output_threaded[i]
      end
    end
    return mind
  end

  # Result
  mind = ( +Inf, 0, 0 )

  # Run pairwise computation
  mind = map_pairwise!(
    (x,y,i,j,d2,out) -> f(i,j,d2,out),
    mind,x,y,box,lc;reduce=reduce_mind
  )
  return (sqrt(mind[1]),mind[2],mind[3])

end

@show NN(A,B)
@show getmind(x,y)

@btime NN(A,B)
@btime getmind(x,y)

nothing

Topic		Replies	Views
Get distances of many points within a max range, any performance improvements? Performance	8	402	November 3, 2021
[ANN] CellListMap.jl: short-ranged particle pairwise computations (lennard-jones, gravitational, etc) Package Announcements	38	2999	June 24, 2022
Any suggestions for starting doing this kind of neighbour list? Numerics	12	1227	April 16, 2020
Cell list algorithm is slower than double for loop Performance	18	2273	April 2, 2020
Find nearest neighbours among boundary sets General Usage question , geo	33	335	November 19, 2024

Closest distance between two Euclidian grids

Related topics