Non-allocating distance function

I have recently realised that isapprox(a, b) is for the most part equivalent to norm(a - b) < tol. In particular when a and b are vectors, this implementation allocates the vector of the difference.

julia> a, b = rand(10), rand(10)
julia> @benchmark isapprox(a, b)
BenchmarkTools.Trial: 10000 samples with 916 evaluations.
 Range (min … max):  116.983 ns … 661.477 ns  ┊ GC (min … max): 0.00% … 78.26%
 Time  (median):     118.547 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   122.136 ns ±  25.514 ns  ┊ GC (mean ± σ):  1.02% ±  4.08%
 Memory estimate: 144 bytes, allocs estimate: 1.

I was wondering whether it would be relevant to add a non-allocating distance function to LinearAlgebra, probably along the following lines.

function dist(a::AbstractArray{T1}, b::AbstractArray{T2}) where {T1<:Number, T2<:Number}
    @assert size(a) == size(b) DimensionMismatch()
    d = zero(promote_type(T1, T2))
    @inbounds @simd for i in eachindex(a)
        d += abs2(a[i] - b[i])

julia> @benchmark dist(a, b)
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  17.107 ns …  1.151 μs  ┊ GC (min … max): 0.00% … 97.25%
 Time  (median):     17.825 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.304 ns ± 15.477 ns  ┊ GC (mean ± σ):  1.02% ±  1.37%
 Memory estimate: 16 bytes, allocs estimate: 1.

Non-allocating versions of norm(a - b) and opnorm(A - B) for exponents other than two could also be added in similar ways.

I’ve debated whether a “norm-of-difference” function would be worthwhile before. Here’s a quick benchmark of a handwritten method versus using generators:

function norm2diff1(x,y)
	s::eltype(x) = 0
	for i in eachindex(x,y)
		@inbounds s += abs2(x[i]-y[i])
	return sqrt(s)
function norm2diff1a(x,y)
	s::eltype(x) = 0
	@simd for i in eachindex(x,y)
		@inbounds s += abs2(x[i]-y[i])
	return sqrt(s)
norm2diff2(x,y) = sqrt(sum(abs2(xx-yy) for (xx,yy) in zip(x,y)))
norm2diff3(x,y) = norm(xx-yy for (xx,yy) in zip(x,y))

using BenchmarkTools

x = randn(100); y = randn(100);

@btime norm2diff1($x,$y)
@btime norm2diff1a($x,$y)
@btime norm2diff2($x,$y)
@btime norm2diff3($x,$y)


julia> @btime norm2diff1($x,$y)
  81.828 ns (0 allocations: 0 bytes)

julia> @btime norm2diff1a($x,$y)
  18.054 ns (0 allocations: 0 bytes)

julia> @btime norm2diff2($x,$y)
  84.703 ns (0 allocations: 0 bytes)

julia> @btime norm2diff3($x,$y)
  267.925 ns (0 allocations: 0 bytes)

So while norm over a generator is slower than the others, this is entirely due to the internal mechanisms of norm (preventing overflow, etc) as a generator-based naive implementation performs identically to a hand-written naive version.

The only standout here is the @simd version – although it looks like the current implementation of norm could support @simd as well. Perhaps that PR can be made, but a specialized version doesn’t appear necessary to achieve this benefit.

So the takeaway is that there doesn’t appear to be any real performance benefit to a specialized “norm-of-difference” function for element-wise inputs over generator-based alternatives. As for opnorm, this function already allocates so I don’t think there are any significant savings to be had for a difference version.

Why? When is isapprox performance critical? Or are you thinking of other applications (which?) in which this is a critical operation?

The only applications that I can think of involve lots of tiny arrays, e.g. pairwise distances between lots of 3-component vectors, but in such cases you are better off using StaticArrays anyway (in which norm(a - b) is already non-allocating).

Note that all of these functions differ from LinearAlgebra.norm(x - y) in that they can suffer spurious overflow for large arguments. e.g.

julia> x, y = [1e300], [-1e300];

julia> norm(x - y)

julia> sqrt(sum(abs2(xx-yy) for (xx,yy) in zip(x,y)))

In fact, not all of my examples suffer from square-to-overflow as I explicitly included a norm-based one and asserted that its comparative slowness was due to that exact consideration.

But I think you missed my point. The point of the other benchmarks was to show that a generator producing the elements of x-y was no slower (modulo the apples-to-oranges use of @simd) than a handwritten function doing the same thing with two input arrays.

1 Like

I was indeed thinking of applications to clustering or removing duplicates, but not necessarily on small vectors.

I realise that I directed my question towards allocations only, whereas I also (mainly) wondering why norm does not use @simd. I understand that the current implementation is general enough to handle iterators that may not be indexable, but wouldn’t it be relevant to use @simd whenever possible?

PR 43256 was going to re-write norm to be careful about overflow only when necessary. Used in the style of norm2diff3 this is 5x faster. Not sure about simd on arrays, perhaps it could be further improved.

1 Like