Pairwise distance using matrix or Vector{SVector}: The first is faster!

I am facing a really unexpected result, that is that the pairwise distance computation using a matrix is actually faster than a vector of static vectors, for small dimensions. here is my simple setup:

using Distances, StaticArrays
D = 10
data  = [rand(SVector{D}) for i in 1:1000]
metrix = Chebyshev()

datam = zeros(D, 1000)
for i in 1:1000
    datam[:, i] .= data[i]
end

let’s now use the pairwise implementation of Distances. For reference, the generic method has the amazingly small source code:

function pairwise!(r::AbstractMatrix, metric::PreMetric, a::AbstractMatrix, b::AbstractMatrix)
    na = size(a, 2)
    nb = size(b, 2)
    size(r) == (na, nb) || throw(DimensionMismatch("Incorrect size of r."))
    @inbounds for j = 1:size(b, 2)
        bj = view(b, :, j)
        for i = 1:size(a, 2)
            r[i, j] = evaluate(metric, view(a, :, i), bj)
        end
    end
    r
end

and compare it with the just as simple:

const VV = Vector{<:SVector}

function pairwiseDS(x::VV, y::VV, metric::Metric)
    d = zeros(eltype(eltype(x)), length(x), length(y))
    for i in 1:length(x)
        for j in 1:length(y)
            @inbounds d[i,j] = evaluate(metric, x[i], y[j])
        end
    end
    return d
end

Benchmarks:

julia> @btime pairwiseDS($data, $data, $metric)
  47.305 ms (2 allocations: 7.63 MiB

# method from Distances:
julia> @btime pairwise($metric, $datam, $datam)
  36.199 ms (2 allocations: 7.63 MiB)

Even though the actual distance evaluation is faster:

v = rand(D)
sv = SVector{D}(v)

@btime evaluate($metric, $v, $v)
  30.293 ns (0 allocations: 0 bytes)
@btime evaluate($metric, $sv, $sv)
  25.173 ns (0 allocations: 0 bytes)

Was it just an unreasonable expectation from my side that the method with Vector{SVector} should be faster? But I mean, the actual distance evaluations are faster. Why isn’t the entire result faster?

I know that Distances uses BLAS operations for many of its pairwise distance computations, which is one of the reasons it is so much faster than the naive code you might write to do this. Is it possible that the package doesn’t recognize that a vector of SVectors is memory-compatible with BLAS and is therefore using a slower fallback implementation instead of the clever BLAS-based one?

Chebyshev is not amenable to the BLAS optimization but this is likely just

https://docs.julialang.org/en/v1/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

1 Like

@kristoffer.carlsson I thought the same, and I was testing it. Indeed putting the j loop first gives

  37.050 ms (2 allocations: 7.63 MiB)

versus the Distances timing:

  35.395 ms (2 allocations: 7.63 MiB)

The later is still faster, I guess the reasons may not be even identifyable?

All reasons are identifiable if you’re willing to work hard enough to figure out what’s going on. The question is if it’s worth it to someone or not :slight_smile:

2 Likes