# Compute pairwise distances (Distances.jl)

While trying to port some of my code from Python (`NumPy` + `Numba`) to Julia, I noticed that the pairwise distance evaluation is at times slightly slower when using `Distances.jl`. It is a pretty trivial piece of code that I am running. Consider for instance, the pairwise distance evaluation for a set of `10000` points in `3D`

``````using BenchmarkTools, Distances
arr = rand(3,10000)
@btime pairwise(Euclidean(), \$arr, \$arr); # 994.005 ms (7 allocations: 763.09 MiB)
``````

vs

``````import numpy as np
from numba import njit
@njit
def compute_distance(x):
out = np.zeros((x.shape, x.shape))
M = x.shape
N = x.shape
for i in range(M):
for j in range(M):
d = 0.
for k in range(N):
d += (x[i, k] - x[j, k])**2.
out[i, j] = d**(0.5)
return out

%timeit compute_distance(np.random.random((10000,3))) # 897 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

``````

Though, the difference in timings isn’t significant, is `Distances.jl` the fastest implementation to calculate pairwise distances ?

There is something tricky coming from the fact that you are computing the distances between an array an itself using `Distances`:

``````Using pairwise, self:   914.045 ms (7 allocations: 763.09 MiB)
Using pairwise, two vectors:   430.143 ms (7 allocations: 763.09 MiB)
``````

where I did:

``````x = rand(3,10_000)
print(" Using pairwise, self: "); @btime pairwise(Euclidean(), \$x, \$x);

x = rand(3,10_000)
y = rand(3,10_000)
print(" Using pairwise, two vectors: "); @btime pairwise(Euclidean(), \$x, \$y);
``````

Therefore, it seems to be cheaper to copy the vector and then call `pairwise`.

Of course if the the distances between the points of one array are needed, you do not need to compute `n^2`, distances, but only `n(n-1)/2`, and there are better implementations using the appropriate loops for that.

If you are interested in the distances between two sets of points, it `Distances` does a good job, particularly if the arrays will be loaded contiguous in memory, because it uses the most of loop vectorizations.

Nevertheless, if the distances to be computed are not within all the points and are not contiguous in memory, I think the best approach is to use a loop computing distances within points defined as `StaticArrays`, because loop vectorizations do not bring the same benefit.

3 Likes

Here’s something that on my box is about twice as fast that uses just a few lines of code and StaticArrays, like @lmiq mentioned.

``````using BenchmarkTools, Distances, StaticArrays

function mydistmat(ar::Vector{SVector{D, Float64}}) where{D}
out = zeros(length(ar), length(ar))
for k in 1:length(ar)
@inbounds out[k,k] = 0.0
for j in 1:(k-1)
@inbounds out[j,k] = norm(ar[j] - ar[k])
end
end
Symmetric(out)
end

const arr = rand(3,10000)
const arr_cols = map(x->SVector{3, Float64}(x), eachcol(arr))

print("Timing for Distances (two args):") # 1.1 s
@btime pairwise(Euclidean(), \$arr, \$arr);

print("Timing for Distances (one arg):") # 1.5 s (?)
@btime pairwise(Euclidean(), \$arr);

print("Timing for Distances (columns):") # 1.03 s
@btime pairwise(Euclidean(), \$arr_cols);

print("Timing with StaticArrays:") # 0.407 s
@btime mydistmat(\$arr_cols);
``````

And there are obviously a lot of optimizations that `mydistmat` doesn’t even have, like any hand-coded SIMD/threading/etc. This is obviously less composable in every way than what `Distances.jl` offers, and I would guess that that’s part of why it is a little slower.

3 Likes

Thanks!

Thank you @cgeoga !

Just note that the speedup is because you are (correctly) doing the loops only on non-repeated computations in the case that the distances are within the elements of a single vector of points, and that cannot be really compared to calling `pairwise` with two arrays (`pairwise` called with a single argument computes the distances between the rows and columns of the matrix):

``````Using static arrays two vec:  895.720 ms (2 allocations: 762.94 MiB)
Using static arrays self:  539.950 ms (2 allocations: 762.94 MiB)
``````
Code
``````using StaticArrays

d(x,y) = sqrt((x-y)^2+ (x-y)^2+ (x-y)^2)

function mydistmat(x,y)
n = length(x)
out = zeros(n,n)
for i in 1:n
for j in 1:n
out[i,j] = d(x[i],y[j])
end
end
out
end

x = rand(SVector{3,Float64},10_000)
y = rand(SVector{3,Float64},10_000)
print("Using static arrays two vec:"); @btime mydistmat(\$x,\$y)

function mydistmat_self(x)
n = length(x)
out = zeros(n,n)
for i in 1:n-1
for j in i+1:n
out[i,j] = d(x[i],x[j])
out[j,i] = out[i,j]
end
end
out
end

x = rand(SVector{3,Float64},10_000)
print("Using static arrays self:"); @btime mydistmat_self(\$x)

``````

Also there is another important part of the speedup that comes from (also correctly) filling the output matrix by columns and returning the symmetric view instead of the dense array:

``````Using static arrays self, cols:  318.673 ms (2 allocations: 762.94 MiB
``````
Code
``````function mydistmat_self_symm(x)
n = length(x)
out = zeros(n,n)
for i in 2:n
for j in 1:(i-1)
out[j,i] = d(x[i],x[j])
end
end
Symmetric(out)
end

x = rand(SVector{3,Float64},10_000)
print("Using static arrays self, cols:"); @btime mydistmat_self_symm(\$x)

``````

1 Like

Some attempts… note that I’ve made these all write into an existing array, as the times are otherwise very variable, for me, I guess that allocating a whole GB each run stresses my computer:

``````julia> out = zeros(10^4, 10^4);

julia> @btime pairwise!(\$out, Euclidean(), \$arr, \$arr, dims=2); # original, but in-place
675.518 ms (4 allocations: 156.41 KiB)

julia> @btime pairwise!(\$out, Euclidean(), \$arr, \$arr2, dims=2); # with a copy of array
311.783 ms (4 allocations: 156.41 KiB)

julia> using Tullio

julia> out ≈ @tullio out2[i,j] := (arr[μ,i] - arr[μ,j])^2 |> sqrt
true

julia> @btime @tullio \$out[i,j] = (\$arr[μ,i] - \$arr[μ,j])^2 |> sqrt;
171.028 ms (50 allocations: 4.81 KiB)

julia> using LoopVectorization

julia> @btime @tullio \$out[i,j] = (\$arr[μ,i] - \$arr[μ,j])^2 |> sqrt;
90.907 ms (50 allocations: 4.81 KiB)

julia> @btime @tullio \$out[i,j] = sqrt((\$arr[1,i] - \$arr[1,j])^2 + (\$arr[2,i] - \$arr[2,j])^2 + (\$arr[3,i] - \$arr[3,j])^2);
79.451 ms (44 allocations: 4.86 KiB)
``````

This is multi-threaded (2 cores) which might count as cheating.

Also, note that transposing the problem makes it faster, at least when using LoopVectorization. It likes having a dimension longer than 3 to vectorise along:

``````julia> arr_t = permutedims(arr);

julia> @btime @tullio \$out[i,j] = (\$arr_t[i,μ] - \$arr_t[j,μ])^2 |> sqrt;
68.236 ms (50 allocations: 4.81 KiB)
``````

What these versions can’t do is iterate only over half the space. For comparison, in-place versions of @lmiq’s SVector code above, on my computer:

``````julia> function mydistmat_self_symm!(out, x)
n = length(x)
for i in 1:n # faster without @inbounds, weirdly?
for j in 1:i
out[j,i] = d3(x[i],x[j])
end
end
Symmetric(out)
end;

julia> d3(x,y) = sqrt((x-y)^2+ (x-y)^2+ (x-y)^2); # called d above

julia> x = rand(SVector{3,Float64},10_000);

julia> mydistmat_self_symm!(out, x) ≈ pairwise(Euclidean(), reduce(hcat,x), reduce(hcat,x), dims=2)
true

julia> @btime mydistmat_self_symm!(\$out, \$x);
87.738 ms (0 allocations: 0 bytes)

54.583 ms (24 allocations: 5.22 KiB)

julia> @btime mydistmat_self!(\$out, \$x); # fills the whole array, no Symmetric
176.911 ms (0 allocations: 0 bytes)

julia> @btime mydistmat_self_threads!(\$out, \$x);  # here closer to a factor of 2
105.831 ms (21 allocations: 5.12 KiB)
``````

It’s very odd that these get slower for me with `@inbounds`, for example:

``````julia> @btime mydistmat_self_symm!(\$out, \$x);
124.784 ms (0 allocations: 0 bytes)
``````
3 Likes

Thanks for the thorough reply. I was about to experiment with TensorOperations/Tullio.

Thanks for such a detailed response. Appreciate your help! I will test these on my laptop.