Compute pairwise distances (Distances.jl)

While trying to port some of my code from Python (NumPy + Numba) to Julia, I noticed that the pairwise distance evaluation is at times slightly slower when using Distances.jl. It is a pretty trivial piece of code that I am running. Consider for instance, the pairwise distance evaluation for a set of 10000 points in 3D

using BenchmarkTools, Distances
arr = rand(3,10000)
@btime pairwise(Euclidean(), $arr, $arr); # 994.005 ms (7 allocations: 763.09 MiB)

vs

import numpy as np
from numba import njit
@njit
def compute_distance(x):
    out = np.zeros((x.shape[0], x.shape[0]))
    M = x.shape[0]
    N = x.shape[1]
    for i in range(M):
        for j in range(M):
            d = 0.
            for k in range(N):
                d += (x[i, k] - x[j, k])**2.
            out[i, j] = d**(0.5)
    return out

%timeit compute_distance(np.random.random((10000,3))) # 897 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Though, the difference in timings isn’t significant, is Distances.jl the fastest implementation to calculate pairwise distances ?

There is something tricky coming from the fact that you are computing the distances between an array an itself using Distances:

Using pairwise, self:   914.045 ms (7 allocations: 763.09 MiB)
Using pairwise, two vectors:   430.143 ms (7 allocations: 763.09 MiB)

where I did:

x = rand(3,10_000)
print(" Using pairwise, self: "); @btime pairwise(Euclidean(), $x, $x);

x = rand(3,10_000)
y = rand(3,10_000)
print(" Using pairwise, two vectors: "); @btime pairwise(Euclidean(), $x, $y);

Therefore, it seems to be cheaper to copy the vector and then call pairwise.

Of course if the the distances between the points of one array are needed, you do not need to compute n^2, distances, but only n(n-1)/2, and there are better implementations using the appropriate loops for that.

If you are interested in the distances between two sets of points, it Distances does a good job, particularly if the arrays will be loaded contiguous in memory, because it uses the most of loop vectorizations.

Nevertheless, if the distances to be computed are not within all the points and are not contiguous in memory, I think the best approach is to use a loop computing distances within points defined as StaticArrays, because loop vectorizations do not bring the same benefit.

3 Likes

Here’s something that on my box is about twice as fast that uses just a few lines of code and StaticArrays, like @lmiq mentioned.

using BenchmarkTools, Distances, StaticArrays

function mydistmat(ar::Vector{SVector{D, Float64}}) where{D}
  out = zeros(length(ar), length(ar))
  for k in 1:length(ar)
    @inbounds out[k,k] = 0.0
    for j in 1:(k-1) 
      @inbounds out[j,k] = norm(ar[j] - ar[k])
    end
  end
  Symmetric(out)
end

const arr = rand(3,10000)
const arr_cols = map(x->SVector{3, Float64}(x), eachcol(arr))

print("Timing for Distances (two args):") # 1.1 s
@btime pairwise(Euclidean(), $arr, $arr);

print("Timing for Distances (one arg):") # 1.5 s (?)
@btime pairwise(Euclidean(), $arr);

print("Timing for Distances (columns):") # 1.03 s
@btime pairwise(Euclidean(), $arr_cols);

print("Timing with StaticArrays:") # 0.407 s
@btime mydistmat($arr_cols);

And there are obviously a lot of optimizations that mydistmat doesn’t even have, like any hand-coded SIMD/threading/etc. This is obviously less composable in every way than what Distances.jl offers, and I would guess that that’s part of why it is a little slower.

3 Likes

Thanks!

Thank you @cgeoga !

Just note that the speedup is because you are (correctly) doing the loops only on non-repeated computations in the case that the distances are within the elements of a single vector of points, and that cannot be really compared to calling pairwise with two arrays (pairwise called with a single argument computes the distances between the rows and columns of the matrix):

Using static arrays two vec:  895.720 ms (2 allocations: 762.94 MiB)
Using static arrays self:  539.950 ms (2 allocations: 762.94 MiB)
Code
using StaticArrays

d(x,y) = sqrt((x[1]-y[1])^2+ (x[2]-y[2])^2+ (x[3]-y[3])^2)

function mydistmat(x,y)
  n = length(x)
  out = zeros(n,n)
  for i in 1:n
    for j in 1:n
      out[i,j] = d(x[i],y[j])
    end
  end
  out
end

x = rand(SVector{3,Float64},10_000)
y = rand(SVector{3,Float64},10_000)
print("Using static arrays two vec:"); @btime mydistmat($x,$y)

function mydistmat_self(x)
  n = length(x)
  out = zeros(n,n)
  for i in 1:n-1
    for j in i+1:n
      out[i,j] = d(x[i],x[j])
      out[j,i] = out[i,j]
    end
  end
  out
end

x = rand(SVector{3,Float64},10_000)
print("Using static arrays self:"); @btime mydistmat_self($x)

Also there is another important part of the speedup that comes from (also correctly) filling the output matrix by columns and returning the symmetric view instead of the dense array:

Using static arrays self, cols:  318.673 ms (2 allocations: 762.94 MiB
Code
function mydistmat_self_symm(x)
  n = length(x)
  out = zeros(n,n)
  for i in 2:n
    for j in 1:(i-1)
      out[j,i] = d(x[i],x[j])
    end
  end
  Symmetric(out)
end

x = rand(SVector{3,Float64},10_000)
print("Using static arrays self, cols:"); @btime mydistmat_self_symm($x)

(Adding inbound flags adds a final bit of performance)

1 Like

Some attempts… note that I’ve made these all write into an existing array, as the times are otherwise very variable, for me, I guess that allocating a whole GB each run stresses my computer:

julia> out = zeros(10^4, 10^4);

julia> @btime pairwise!($out, Euclidean(), $arr, $arr, dims=2); # original, but in-place
  675.518 ms (4 allocations: 156.41 KiB)

julia> @btime pairwise!($out, Euclidean(), $arr, $arr2, dims=2); # with a copy of array
  311.783 ms (4 allocations: 156.41 KiB)

julia> using Tullio

julia> out ≈ @tullio out2[i,j] := (arr[μ,i] - arr[μ,j])^2 |> sqrt
true

julia> @btime @tullio $out[i,j] = ($arr[μ,i] - $arr[μ,j])^2 |> sqrt;
  171.028 ms (50 allocations: 4.81 KiB)

julia> using LoopVectorization

julia> @btime @tullio $out[i,j] = ($arr[μ,i] - $arr[μ,j])^2 |> sqrt;
  90.907 ms (50 allocations: 4.81 KiB)

julia> @btime @tullio $out[i,j] = sqrt(($arr[1,i] - $arr[1,j])^2 + ($arr[2,i] - $arr[2,j])^2 + ($arr[3,i] - $arr[3,j])^2);
  79.451 ms (44 allocations: 4.86 KiB)

This is multi-threaded (2 cores) which might count as cheating.

Also, note that transposing the problem makes it faster, at least when using LoopVectorization. It likes having a dimension longer than 3 to vectorise along:

julia> arr_t = permutedims(arr);

julia> @btime @tullio $out[i,j] = ($arr_t[i,μ] - $arr_t[j,μ])^2 |> sqrt;
  68.236 ms (50 allocations: 4.81 KiB)

What these versions can’t do is iterate only over half the space. For comparison, in-place versions of @lmiq’s SVector code above, on my computer:

julia> function mydistmat_self_symm!(out, x)
         n = length(x)
         for i in 1:n # faster without @inbounds, weirdly?
           for j in 1:i
             out[j,i] = d3(x[i],x[j])
           end
         end
         Symmetric(out)
       end;

julia> d3(x,y) = sqrt((x[1]-y[1])^2+ (x[2]-y[2])^2+ (x[3]-y[3])^2); # called d above

julia> x = rand(SVector{3,Float64},10_000);

julia> mydistmat_self_symm!(out, x) ≈ pairwise(Euclidean(), reduce(hcat,x), reduce(hcat,x), dims=2)
true

julia> @btime mydistmat_self_symm!($out, $x);
  87.738 ms (0 allocations: 0 bytes)

julia> @btime mydistmat_self_symm_threads!($out, $x); # version with Threads.@threads, but it's not dividing the triangle well
  54.583 ms (24 allocations: 5.22 KiB)

julia> @btime mydistmat_self!($out, $x); # fills the whole array, no Symmetric
  176.911 ms (0 allocations: 0 bytes)

julia> @btime mydistmat_self_threads!($out, $x);  # here closer to a factor of 2
  105.831 ms (21 allocations: 5.12 KiB)

It’s very odd that these get slower for me with @inbounds, for example:

julia> @btime mydistmat_self_symm!($out, $x);
  124.784 ms (0 allocations: 0 bytes)
3 Likes

Thanks for the thorough reply. I was about to experiment with TensorOperations/Tullio.

Thanks for such a detailed response. Appreciate your help! I will test these on my laptop.

In the case of python, scipy.spatial.distance_matrix is normally much faster than the code given in the Q (scipy is the mathematical extension of numpy)… although I have never compared them using numba.