 # Faster squared euclidean distance calculation

Hi all. Trying to speed up a small function that multiplies an array by vector before doing a distance calculation.

A small reprex:

``````using BenchmarkTools
using Distances
using Tullio

function calc1(r::Array{Float64, 1}, Z1::Array{Float64, 2})
Z1r = Z1' .* sqrt.(r)
result = pairwise(sqeuclidean, Z1r, dims=2)
return(result')
end

ncols = 20
r = rand(ncols)
Z1 = rand(100, ncols)

julia> @btime calc1(\$r, \$Z1);

45.641 μs (4 allocations: 94.83 KiB)
``````

Taking inspiration another thread (Compute pairwise distances (Distances.jl) - #8 by mcabbott), I made a tullio version:

``````function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
@tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
result = zeros(size(Z1))
@tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
return(result)
end

julia> @btime calc2(\$r, \$Z1);

52.282 μs (28 allocations: 110.86 KiB)
``````

Alas this is a little bit slower. I also note the results are slightly different , is this simply unavoidable ?

``````julia> calc1(r, Z1) == calc2(r, Z1)
false

julia> isapprox(calc1(r, Z1), calc2(r, Z1))
true
``````

Does anyone see any way to speed things up apart from the obvious things to use Float32 or LoopVectorisation? I feel as though I’m not quite seeing an opportunity to speed things up somewhere.

I think the tullio version should become much faster if you also import LoopVectorization.
Also, `:= ` allocates a new array in tullio, so the second line of your calc2 function is redundant, I think.

1 Like

I think that `pairwise` from `Distances` is internally a call to a very efficient BLAS library, which by default runs in parallel already. Thus, for that specific type of computation I think it is hard to beat it.

Unless you explore a little bit further the structure of your problem and try to avoid constructing `Z1r` at all, for example (not sure if its doable, just guessing here).

1 Like

As @IlianPihlajamaa correctly points out, there’s a redundant allocation in `calc2`, and LoopVectorization is needed for best results. My times:

``````julia> function calc1(r::Array{Float64, 1}, Z1::Array{Float64, 2})
Z1r = Z1' .* sqrt.(r)
result = pairwise(SqEuclidean(), Z1r, dims=2)  # must be SqEuclidean()
return result'
end;

julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
@tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
@tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
end;

julia> @btime calc1(\$r, \$Z1);
22.542 μs (4 allocations: 94.80 KiB)

julia> @btime calc2(\$r, \$Z1);
21.333 μs (28 allocations: 95.23 KiB)

julia> using LoopVectorization

julia> function calc2(r::Array{Float64, 1}, Z1::Array{Float64, 2})
# the macro needs to run after LoopVectorization is loaded
@tullio Z1r[j,i] := Z1[i,j] * sqrt(r[j])
@tullio result[j,i] := (Z1r[μ,i] - Z1r[μ,j])^2
end;

julia> @btime calc2(\$r, \$Z1);
13.167 μs (27 allocations: 95.20 KiB)

julia> calc1(r, Z1) ≈ calc2(r, Z1)
true

julia> norm(calc1(r, Z1) .- calc2(r, Z1))
1.1767515652473147e-13

julia> ncols = 5_000;  # try a larger size too
julia> r = rand(ncols);
julia> Z1 = rand(100, ncols);

julia> @btime calc1(\$r, \$Z1);
2.009 ms (5 allocations: 3.89 MiB)

julia> @btime calc2(\$r, \$Z1);
2.230 ms (98 allocations: 3.90 MiB)
``````

Small numerical errors are pretty much unavoidable. I see Distances.jl now has a parameter `SqEuclidean(1e-12)` which controls when it should re-calculate very small distances by a more accurate method.

4 Likes

Great answers all! I had not noticed the redundant calculation, thanks @IlianPihlajamaa and @mcabbott

Alas, this is trying to match an R function so I’m stuck with it. However, I’ve been exploring a one line answer by moving the r multiplication - the reason I’m exploring this is because there will be another version with a Z2 and corresponding Z2r, and so if I could only multiplying by r once that would save time. So I was playing with this:

``````function calc3(r::Array{Float64, 1}, Z1::Array{Float64, 2})
@tullio result[i,j] := (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
end
``````

But the answers are coming out very wrong and this is quite slow. I think I’m making an algebra mistake but I can’t see it. But otherwise wondering why this is slower?

shouldn’t it be `sqrt(x^2 + y^2)`? you have ^2 outermost

No its the squared euclidean distance I want - I updated the title to reflect that just now. The solutions above all work except calc3 which is wrong somewhere but I can’t see where.

I think this is the problem with `calc3` – it is independently summing over `u`, which is wrong, and another nested loop, thus slow. You should be able to skip the `sqrt` too:

``````julia> calc4(r, Z1) = @tullio _[j,i] := r[μ] * (Z1[i,μ] - Z1[j,μ])^2;

julia> @btime calc4(\$r, \$Z1);
13.875 μs (26 allocations: 79.45 KiB)

julia> calc1(r, Z1) ≈ calc4(r, Z1)
true
``````
1 Like

Ah great @mcabbott !

Ok I’m missing something subtle in the syntax here - that code looks the same to me in calc3 and calc4? Apart from the sqrt whats different in calc4 ?

The difference is greek mu `μ` vs latin `u`. Hard to spot in some fonts!

It’s a bit of a footgun with the “automatically sum over free indices” rule. It’s possible that requiring you to specify what’s reduced over would be safer, if less concise:

``````julia> using TensorCast

julia> @reduce result[i,j] := sum(μ) (sqrt(r[μ]) * (Z1[i,μ] - Z1[j,u]))^2
ERROR: LoadError: can't find index u on the left
``````
2 Likes

Oh wow I had no idea I’d used two different letters there   Totally my fault there, don’t blame the syntax please!

Thank you!

Let `Q` an `m x d` matrix of `m` query points, and `C` an `n x d` matrix of `n` corpus points. Then

``````D2 = sum(Q.^2,dims=2) .- 2 * Q * C' .+ sum(C.^2,dims=2)'
``````

then `D2` is the `m x n` matrix of the squared Euclidean distances.

Of course, you may precompute the sum of squares if you re-search the same corpus with different queries and vice-versa.

The above formula is the known identity (a - b)^2 = a^2 - 2 a b + b^2

2 Likes