Fastest algorithm for (xi - xj) * (yi - yj) for i in 1:n, j in 1:n

Suppose we are given two “big” vectors x and y of real numbers (e.g. Float64), both with the same length n. What is the fastest algorithm to compute

[(x[i] - x[j]) * (y[i] - y[j]) for i in 1:n, j in 1:n]

?

Here is a baseline:

using BenchmarkTools

n = 10_000
x = rand(n)
y = rand(n)

function baseline(x, y)
  n = length(x)
  D = Matrix{Float64}(undef, n, n)
  @inbounds for j in 1:n
    xj = x[j]
    yj = y[j]
    for i in 1:n
      xi = x[i]
      yi = y[i]
      D[i,j] = (xi - xj) * (yi - yj)
    end
  end
end

@btime baseline(x, y)
  168.321 ms (2 allocations: 762.94 MiB)

Appreciate any help with the following questions:

  1. Can you find a faster algorithm that exploits parallel hardware (e.g. GPU)?
  2. Is it possible to rely on FFT algorithms to compute the output efficiently?
  3. Given the proposed algorithm, what is the largest n assuming 32Gb of RAM?

A very simple observation, although it does not go into the “difficult” parts of the question: since the resulting matrix is symmetric, you can save half of the time and memory by computing only the upper or lower triangle.

2 Likes

Yes, I forgot to add this optimization to the baseline. Doing it now, thanks.

Actually, my attempt to exploit symmetry in row-major orientation led to slower run time. Appreciate if someone can help.

Loop vectorization?

I added using LinearAlgebra and changed the iterator of the inner loop by for i in 1:j; and then returned Symmetric(D) instead of just D. That reduced the result of @btime from 396.495 ms to 201.799 ms in my computer (although it did not decrease memory usage).

Two further differences: (1) your example did not return any value; (2) it should be @btime baseline($x, $y), with escaped arguments, in order to report more accurate results.

It seems already optimal, if you actually need a materialized n*n array. There are n^2 different memory accesses and n^2 operations, so this is memory-limited. Basically, your function is already as fast as zeros(n, n).

No, this is not something an FFT can help with — you have n^2 outputs, so you can’t compute it in less than \Theta(n^2) operations.

Assuming you mean 32GiB = 32 \times 2^{30} bytes, this is basically just solving 8n^2 = 32 \times 2^{30} \implies n = 2^{16} = 65536, since sizeof(Float64) == 8, i.e. finding the biggest matrix that fits in RAM. In practice, you can’t fill up all the RAM with your matrix, since you need some memory for your OS, for Julia, etcetera. But a few tens of thousands.

Depending upon what you want to do with the matrix, the best thing may be to not store the matrix at all, but to compute its entries as needed. For example, if you are doing linear algebra with this matrix A , then you can exploit the fact that you can compute A z (matrix–vector products) with \Theta(n) operations and \Theta(n) storage, by e.g.:

matvec(x, y, z) = (x .* y) .* sum(z) .+ sum(x .* y .* z) .- y .* (x'z) .- x .* (y'z)

which gives the same results as baseline(x, y) * z (assuming you add a return D to the end of your baseline). So, many linear-algebra calculations can be made much more efficient by using iterative algorithms based off of this.

12 Likes

stevengj’s suggestion of not materializing the matrix is great. If it doesn’t work out, one imrovement might be to tile the calculation so the tiles make the calculation more cache friendly. Something like:

for j1 in 1:K:n
    for i1 in 1:K:n
        for j in j1:min(j1+K-1, n)
            for i in i1:min(i1+K-1, n)
...

For the right K it could make x[i], x[j], y[i], y[j] accesses stay within cache.

1 Like

@tturbo from LoopVectorization makes a difference:

julia> @btime baseline($x, $y)
  196.333 ms (2 allocations: 762.94 MiB)

julia> function baseline(x, y)
         n = length(x)
         D = Matrix{Float64}(undef, n, n)
         @tturbo for j in 1:n
           xj = x[j]
           yj = y[j]
           for i in 1:n
             xi = x[i]
             yi = y[i]
             D[i,j] = (xi - xj) * (yi - yj)
           end
         end
       end
baseline (generic function with 1 method)

julia> @btime baseline($x, $y)
  47.542 ms (2 allocations: 762.94 MiB)

julia> Threads.nthreads()
16
4 Likes

Thank you all, what about GPUs? Can we leverage some package to convert the loops into an efficient kernel?

Tullio?

To give some context, this issue is showing up in a geostatistical clustering algorithm known as GHC. Given geospatial data, this algorithm constructs a dissimilarity matrix using both the geographic coordinates and the features to constraint the clusters spatially:

using GeoStats
import GLMakie

# sinusoidal field
Ω = georef((Z=[10sin(i/10) + j for i in 1:4:100, j in 1:4:100],))

Ω |> viewer

image

# request 20 clusters
Ω |> GHC(20, 1.0) |> viewer

image

You can see the current implementation here, which uses kron products of sparse matrices to avoid blowing up memory. Any suggestion of improvement is welcome.

I will take a look at Tullio.jl, it seems great. I never had a chance to try these packages but now I have a good challenge to tackle :slight_smile:

1 Like