Fastest algorithm for (xi - xj) * (yi - yj) for i in 1:n, j in 1:n

juliohm · December 1, 2023, 11:19am

Suppose we are given two “big” vectors x and y of real numbers (e.g. Float64), both with the same length n. What is the fastest algorithm to compute

[(x[i] - x[j]) * (y[i] - y[j]) for i in 1:n, j in 1:n]

?

Here is a baseline:

using BenchmarkTools

n = 10_000
x = rand(n)
y = rand(n)

function baseline(x, y)
  n = length(x)
  D = Matrix{Float64}(undef, n, n)
  @inbounds for j in 1:n
    xj = x[j]
    yj = y[j]
    for i in 1:n
      xi = x[i]
      yi = y[i]
      D[i,j] = (xi - xj) * (yi - yj)
    end
  end
end

@btime baseline(x, y)
  168.321 ms (2 allocations: 762.94 MiB)

Appreciate any help with the following questions:

Can you find a faster algorithm that exploits parallel hardware (e.g. GPU)?
Is it possible to rely on FFT algorithms to compute the output efficiently?
Given the proposed algorithm, what is the largest n assuming 32Gb of RAM?

heliosdrm · December 1, 2023, 11:32am

A very simple observation, although it does not go into the “difficult” parts of the question: since the resulting matrix is symmetric, you can save half of the time and memory by computing only the upper or lower triangle.

juliohm · December 1, 2023, 11:33am

Yes, I forgot to add this optimization to the baseline. Doing it now, thanks.

juliohm · December 1, 2023, 11:38am

Actually, my attempt to exploit symmetry in row-major orientation led to slower run time. Appreciate if someone can help.

Tarny_GG_Channie · December 1, 2023, 11:49am

Loop vectorization?

heliosdrm · December 1, 2023, 12:06pm

I added using LinearAlgebra and changed the iterator of the inner loop by for i in 1:j; and then returned Symmetric(D) instead of just D. That reduced the result of @btime from 396.495 ms to 201.799 ms in my computer (although it did not decrease memory usage).

Two further differences: (1) your example did not return any value; (2) it should be @btime baseline($x, $y), with escaped arguments, in order to report more accurate results.

aplavin · December 1, 2023, 12:21pm

It seems already optimal, if you actually need a materialized n*n array. There are n^2 different memory accesses and n^2 operations, so this is memory-limited. Basically, your function is already as fast as zeros(n, n).

stevengj · December 1, 2023, 12:22pm

No, this is not something an FFT can help with — you have n^2 outputs, so you can’t compute it in less than \Theta(n^2) operations.

Assuming you mean 32GiB = 32 \times 2^{30} bytes, this is basically just solving 8n^2 = 32 \times 2^{30} \implies n = 2^{16} = 65536, since sizeof(Float64) == 8, i.e. finding the biggest matrix that fits in RAM. In practice, you can’t fill up all the RAM with your matrix, since you need some memory for your OS, for Julia, etcetera. But a few tens of thousands.

Depending upon what you want to do with the matrix, the best thing may be to not store the matrix at all, but to compute its entries as needed. For example, if you are doing linear algebra with this matrix A , then you can exploit the fact that you can compute A z (matrix–vector products) with \Theta(n) operations and \Theta(n) storage, by e.g.:

matvec(x, y, z) = (x .* y) .* sum(z) .+ sum(x .* y .* z) .- y .* (x'z) .- x .* (y'z)

which gives the same results as baseline(x, y) * z (assuming you add a return D to the end of your baseline). So, many linear-algebra calculations can be made much more efficient by using iterative algorithms based off of this.

Dan · December 1, 2023, 12:46pm

stevengj’s suggestion of not materializing the matrix is great. If it doesn’t work out, one imrovement might be to tile the calculation so the tiles make the calculation more cache friendly. Something like:

for j1 in 1:K:n
    for i1 in 1:K:n
        for j in j1:min(j1+K-1, n)
            for i in i1:min(i1+K-1, n)
...

For the right K it could make x[i], x[j], y[i], y[j] accesses stay within cache.

lmiq · December 1, 2023, 12:47pm

@tturbo from LoopVectorization makes a difference:

julia> @btime baseline($x, $y)
  196.333 ms (2 allocations: 762.94 MiB)

julia> function baseline(x, y)
         n = length(x)
         D = Matrix{Float64}(undef, n, n)
         @tturbo for j in 1:n
           xj = x[j]
           yj = y[j]
           for i in 1:n
             xi = x[i]
             yi = y[i]
             D[i,j] = (xi - xj) * (yi - yj)
           end
         end
       end
baseline (generic function with 1 method)

julia> @btime baseline($x, $y)
  47.542 ms (2 allocations: 762.94 MiB)

julia> Threads.nthreads()
16

juliohm · December 1, 2023, 2:01pm

Thank you all, what about GPUs? Can we leverage some package to convert the loops into an efficient kernel?

Tarny_GG_Channie · December 1, 2023, 2:06pm

Tullio?

juliohm · December 1, 2023, 2:10pm

To give some context, this issue is showing up in a geostatistical clustering algorithm known as GHC. Given geospatial data, this algorithm constructs a dissimilarity matrix using both the geographic coordinates and the features to constraint the clusters spatially:

using GeoStats
import GLMakie

# sinusoidal field
Ω = georef((Z=[10sin(i/10) + j for i in 1:4:100, j in 1:4:100],))

Ω |> viewer

# request 20 clusters
Ω |> GHC(20, 1.0) |> viewer

You can see the current implementation here, which uses kron products of sparse matrices to avoid blowing up memory. Any suggestion of improvement is welcome.

juliohm · December 1, 2023, 2:11pm

I will take a look at Tullio.jl, it seems great. I never had a chance to try these packages but now I have a good challenge to tackle

Topic		Replies	Views
Fastest algorithm to compute `[x[i]*x[j] for i <= j where 1 <= i, j <= length(x)]`? New to Julia	8	619	January 23, 2019
Vector - Matrix - Vector multiplication Performance	19	4216	March 14, 2021
Faster quadratic expression for symmetric matrices Performance question , linearalgebra	12	789	December 27, 2022
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1639	February 9, 2022
Unexpected poor FOR loops performance Performance question	8	410	June 28, 2025

Fastest algorithm for (xi - xj) * (yi - yj) for i in 1:n, j in 1:n

Related topics