Faster squared euclidean distance calculation

Hi all. Trying to speed up a small function that multiplies an array by vector before doing a distance calculation.

A small reprex:

using BenchmarkTools
using Distances
using Tullio

function calc1(r::Array{Float64, 1}, Z1::Array{Float64, 2})
    Z1r = Z1' .* sqrt.(r)
    result = pairwise(sqeuclidean, Z1r, dims=2)
    return(result')
end

ncols = 20
r = rand(ncols)
Z1 = rand(100, ncols)

julia> @btime calc1($r, $Z1);

  45.641 μs (4 allocations: 94.83 KiB)

Taking inspiration another thread (Compute pairwise distances (Distances.jl) - #8 by mcabbott), I made a tullio version:

function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
    @tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
    result = zeros(size(Z1))
    @tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
    return(result)
end

julia> @btime calc2($r, $Z1);

  52.282 μs (28 allocations: 110.86 KiB)

Alas this is a little bit slower. I also note the results are slightly different , is this simply unavoidable ?

julia> calc1(r, Z1) == calc2(r, Z1)
false

julia> isapprox(calc1(r, Z1), calc2(r, Z1))
true

Does anyone see any way to speed things up apart from the obvious things to use Float32 or LoopVectorisation? I feel as though I’m not quite seeing an opportunity to speed things up somewhere.

I think the tullio version should become much faster if you also import LoopVectorization.
Also, := allocates a new array in tullio, so the second line of your calc2 function is redundant, I think.

1 Like

I think that pairwise from Distances is internally a call to a very efficient BLAS library, which by default runs in parallel already. Thus, for that specific type of computation I think it is hard to beat it.

Unless you explore a little bit further the structure of your problem and try to avoid constructing Z1r at all, for example (not sure if its doable, just guessing here).

1 Like

As @IlianPihlajamaa correctly points out, there’s a redundant allocation in calc2, and LoopVectorization is needed for best results. My times:

julia> function calc1(r::Array{Float64, 1}, Z1::Array{Float64, 2})
           Z1r = Z1' .* sqrt.(r)
           result = pairwise(SqEuclidean(), Z1r, dims=2)  # must be SqEuclidean()
           return result'
       end;

julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
           @tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
           @tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
       end;

julia> @btime calc1($r, $Z1);
  22.542 μs (4 allocations: 94.80 KiB)

julia> @btime calc2($r, $Z1);
  21.333 μs (28 allocations: 95.23 KiB)

julia> using LoopVectorization

julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
           # the macro needs to run after LoopVectorization is loaded
           @tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
           @tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
       end;

julia> @btime calc2($r, $Z1);
  13.167 μs (27 allocations: 95.20 KiB)

julia> calc1(r, Z1) ≈ calc2(r, Z1)
true

julia> norm(calc1(r, Z1) .- calc2(r, Z1))
1.1767515652473147e-13

julia> ncols = 5_000;  # try a larger size too
julia> r = rand(ncols);
julia> Z1 = rand(100, ncols);

julia> @btime calc1($r, $Z1);
  2.009 ms (5 allocations: 3.89 MiB)

julia> @btime calc2($r, $Z1);
  2.230 ms (98 allocations: 3.90 MiB)

Small numerical errors are pretty much unavoidable. I see Distances.jl now has a parameter SqEuclidean(1e-12) which controls when it should re-calculate very small distances by a more accurate method.

4 Likes

Great answers all! I had not noticed the redundant calculation, thanks @IlianPihlajamaa and @mcabbott

Alas, this is trying to match an R function so I’m stuck with it. However, I’ve been exploring a one line answer by moving the r multiplication - the reason I’m exploring this is because there will be another version with a Z2 and corresponding Z2r, and so if I could only multiplying by r once that would save time. So I was playing with this:

function calc3(r::Array{Float64, 1}, Z1::Array{Float64, 2})
    @tullio result[i,j] := (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
end

But the answers are coming out very wrong and this is quite slow. I think I’m making an algebra mistake but I can’t see it. But otherwise wondering why this is slower?

shouldn’t it be sqrt(x^2 + y^2)? you have ^2 outermost

No its the squared euclidean distance I want - I updated the title to reflect that just now. The solutions above all work except calc3 which is wrong somewhere but I can’t see where.

I think this is the problem with calc3 – it is independently summing over u, which is wrong, and another nested loop, thus slow. You should be able to skip the sqrt too:

julia> calc4(r, Z1) = @tullio _[j,i] := r[μ] * (Z1[i,μ] - Z1[j,μ])^2;

julia> @btime calc4($r, $Z1);
  13.875 μs (26 allocations: 79.45 KiB)

julia> calc1(r, Z1) ≈ calc4(r, Z1)
true
1 Like

Ah great @mcabbott !

Ok I’m missing something subtle in the syntax here - that code looks the same to me in calc3 and calc4? Apart from the sqrt whats different in calc4 ?

The difference is greek mu μ vs latin u. Hard to spot in some fonts!

It’s a bit of a footgun with the “automatically sum over free indices” rule. It’s possible that requiring you to specify what’s reduced over would be safer, if less concise:

julia> using TensorCast

julia> @reduce result[i,j] := sum(μ) (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
ERROR: LoadError: can't find index u on the left
2 Likes

Oh wow I had no idea I’d used two different letters there :laughing: :man_facepalming: :man_facepalming: Totally my fault there, don’t blame the syntax please!

Thank you!

Let Q an m x d matrix of m query points, and C an n x d matrix of n corpus points. Then

D2 = sum(Q.^2,dims=2) .- 2 * Q * C' .+ sum(C.^2,dims=2)'

then D2 is the m x n matrix of the squared Euclidean distances.

Of course, you may precompute the sum of squares if you re-search the same corpus with different queries and vice-versa.

The above formula is the known identity (a - b)^2 = a^2 - 2 a b + b^2

2 Likes