Accelerating the calculation of minimum distance between points, annotating which is the point

I am writing a package that needs to compute the minimum distance between sets of coordinates in space of atoms (ComplexMixtures). I am already using cell lists, but at the very end the one finishes with the problem of computing the minimum distance between an atom and a set of other atoms. The inner function is simple:

function minimumdistance(x,y)
  jatom = 0
  dmin = +Inf
  @inbounds for j in axes(y,2)
    d = dsquare(x,@view(y[1:3,j]))
    if d < dmin
      jatom = j
      dmin = d
    end
  end
  return sqrt(dmin), jatom
end

where x is a vector of dimension 3 (coordiantes of one atom) and y is an array of dimension (3,N), the set of coordinates of the other atoms.

Note that I have not only to know the minimum distance between x and the set of atoms in y, but also which atom of y corresponds to that minimum distance.

I have implemented a version of this function using StaticArrays, and that is nice, the code becomes about 30% faster with that.

However, If I try to use LoopVectorizations, it seems that I could get a much greater speedup. I am saying “it seems”, because I am not able to get that because of the if dmin < d ... statement. Without that conditional the code, loop-vectorized, becomes 6 times faster than the faster alternative. However, because of that conditional, I cannot get anything from the loop vectorizations, I simply get:

if d < dmin
    #= /home/leandro/Drive/Work/JuliaPlay/mind.jl:40 =#
    jatom = j
    #= /home/leandro/Drive/Work/JuliaPlay/mind.jl:41 =#
    dmin = d
end
ERROR: LoadError: LoadError: "Don't know how to handle expression."

If anyone has any idea on how to accelerate this code, by using loop vectorizations or anything else, please share :slight_smile: .

The code below is a complete running implementation of the options I have been exploring so far. I have taken the care of generating the set y not contiguous in memory, because this is more realistic relative to what I have in my case (copying the data to contiguous vectors is not an option as far I have tested).

In the code below I have commented the conditional that impairs the use of the loop vectorizations, and I get this promising benchmark:

  17.279 ns (0 allocations: 0 bytes)  # common arrays
  12.787 ns (0 allocations: 0 bytes)  # static vectors
  2.027 ns (0 allocations: 0 bytes)  # vectorized, but without conditional

If I comment the conditional on the second function I get the same result as in the third, very likely because the loop gets vectorized automatically by the compiler (certainly is not because two assignments and one comparison of numbers takes 6 times the time of computing the squared distance of two vectors).

Thus, it seems that if that loop could take advantage of vectorization, with the conditional, a significant speedup could be possible.

using BenchmarkTools
using StaticArrays
using LoopVectorization

@inline dsquare(x,y) = (x[1]-y[1])^2 + (x[2]-y[2])^2 + (x[3]-y[3])^2

function minimumdistance(x,y)
  jatom = 0
  dmin = +Inf
  @inbounds for j in axes(y,2)
    d = dsquare(x,@view(y[1:3,j]))
    if d < dmin
      jatom = j
      dmin = d
    end
  end
  return sqrt(dmin), jatom
end

function minimumdistance_static(x,y)
  jatom = 0
  dmin = +Inf
  @inbounds for j in eachindex(y)
    d = dsquare(x,y[j])
    if d < dmin
      jatom = j
      dmin = d
    end
  end
  return sqrt(dmin), jatom
end

function minimumdistance_loopvec(x,y)
  jatom = 0
  dmin = +Inf
  @avx for j in eachindex(y)
    d = dsquare(x,y[j])
    #if d < dmin
    #  jatom = j
    #  dmin = d
    #end
  end
  return sqrt(dmin), jatom
end

# "All" 1000 atomic coordinates
x = rand(3) ; yall = rand(3,1000);

# Random subsets (such that the coordinates are not contiguous)
N = 5
ylist = rand(1:1000,N)
y = @view(yall[1:3,ylist])

# Alternativelly, we could have arrays of SVectors:
sx = SVector(x...)
syall = [ SVector(yall[1:3,i]...) for i in 1:size(yall,2) ]

# Subset of the same N atoms of y
sy = @view(syall[ylist])

@btime minimumdistance($x,$y)
@btime minimumdistance_static($sx,$sy)
@btime minimumdistance_loopvec($sx,$sy)
                                                        
1 Like

You are getting tricked by the compiler. Try benchmarking minimumdistance_loopvec with syall. The loop is getting elided, because it doesn’t affect the outcome of the computation. It is basically the same as writing

function minimumdistance_loopvec(x,y)
    jatom = 0
    dmin = +Inf
    return sqrt(dmin), jatom
end

You can see that

julia> @btime sqrt(Inf)
  2.310 ns (0 allocations: 0 bytes)
Inf

You need to get the loop to calculate something to get a real benchmark.

2 Likes

If you change your function to this, you get a better idea about the possible performance:

function minimumdistance_loopvec(x,y)
    jatom = 0
    dmin = +Inf
    @avx for j in eachindex(y)
        d = dsquare(x,y[j])
        dmin = min(dmin, d)
    end
    return sqrt(dmin), jatom
end

This will give you the wrong index, but actually runs the inner loop.

Benchmark:

julia> @btime minimumdistance_loopvec($sx,$sy)
  12.926 ns (0 allocations: 0 bytes)

Very close to the other timings with StaticArrays.

1 Like

I think taking a view is slowing things down:

julia> @btime minimumdistance($x,$y)
  22.236 ns (0 allocations: 0 bytes)
(0.26799890077104815, 1)

julia> @btime minimumdistance_static($sx,$sy)
  14.224 ns (0 allocations: 0 bytes)
(0.26799890077104815, 1)

julia> sy_get = syall[ylist];

julia> @btime minimumdistance_static($sx,$sy_get)
  9.488 ns (0 allocations: 0 bytes)
(0.3771984138347722, 4)

julia> y_get = yall[1:3,ylist];

julia> @btime minimumdistance($x,$y_get)
  13.722 ns (0 allocations: 0 bytes)
(0.3771984138347722, 4)

Edit – I thought I had an @avx version, but was mistaken.

LoopVectorization.check_args(sy) === false, so any use of it with that input is going to give the fallback which IIRC is just @inbounds @fastmath.

To get more out of it, I think you may want permutedims(y) so that it might vectorise along the j index.

BTW, this is more complicated than necessary. Try this

syall = reinterpret(SVector{3, eltype(yall)}, yall);

It’s 150,000 times faster, and requires zero allocations. (well, I guess it basically takes zero time, more or less.)

1 Like
jatom = ifelse(d<dmin,  j, jatom)
dmin = min(dmin, d)

This could be replaced with

(dmin, jatom) = ifelse(d < dmin, (d, j), (dmin, jatom))

No performance difference, though.

1 Like

There is a trick which can remove conditions, but it makes everything slower

function minimumdistance5(x,y)
  jatom = 0
  dmin = 1000000000.0
  @avx for j in eachindex(y)
    d = dsquare(x, y[j])
    b = Int(d < dmin)
    jatom = b * j + (1 - b) * jatom
    dmin = b * d + (1 - b) * dmin
  end
  return sqrt(dmin), jatom
end

There were issues with the multiplication of Inf and Bool so I made some ugly changes. It can be done better, but since it runs slower it doesn’t matter anyway.

But, it looks like changing of @inbounds to @inbounds @simd in minimumdistance_static give small increase, in my calculations I see speedup from 13.463 ns to 12.672.

This should work:

b = d < dmin
jatom = b * j + !b * jatom
dmin = b * d + !b * dmin

But if you just want a really big float that isn’t Inf, you can use

julia> prevfloat(Inf)
1.7976931348623157e308
3 Likes

prevfloat is nice! It’s good to know, thank you very much.

But anyway, this approach is too slow.

:grimacing: :grimacing: of course… I hate myself…

Certainly, but that is on purpose. That reflects the kind of data I get into the function in the package. I cannot avoid feeding it with a non-contiguous array there (except if I was willing to copy a lot of data before calling it, but my tests show that that is not an option).

Good to know about that (but that is only the way I am generating the data in this MWE. The actual data is generated otherwise.