I wonder the most efficient way to calculate the pairwise correlation between rows of a `Matrix{Float64}`

with a large number of rows and columns. I know “*large*” is subjective. Suppose the *large* is `(5000, 50000)`

(meaning `5000`

rows, and `50000`

columns). I asked GitHub Copilot chat for a suggestion about a high-performance code in this regard, and I got the following snippet as the suggestion:

```
using LinearAlgebra
function pairwise_correlations(A::Matrix{Float64})
n = size(A, 1)
C = similar(A, n, n)
for i in 1:n
BLAS.ger!(1.0, A[i, :], A[i, :], C)
end
BLAS.syrk!('L', 'T', 1.0, C, 0.0, C)
for i in 1:n
C[i, i] = 1.0
end
return C
end
```

And the explanation:

Note that calculating pairwise correlations between a large number of rows can be computationally expensive, so it’s important to consider the performance implications of this operation. The

`BLAS`

functions used in this example code snippet are highly optimized and can provide significant performance improvements over other methods, but they may not be the most efficient solution for all use cases.

Surprisingly, the code is super slow that I couldn’t measure it using `@benchmark`

. Still, the following snippet works better:

```
julia> a = rand(5000, 50000);
julia> using Statistics
julia> @benchmark cor($a, dims=2)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 23.153 s (0.00% GC) to evaluate,
with a memory estimate of 2.05 GiB, over 18 allocations.
```

I wonder how this can be improved.

Another suggested snippet is:

```
using Distributed
using DistributedArrays
# Start a local cluster with 4 worker processes
addprocs(4)
@everywhere using LinearAlgebra
function pairwise_correlations(A::Matrix{Float64})
n = size(A, 1)
C = drandn(n, n)
for i in 1:n
BLAS.ger!(1.0, A[i, :], A[i, :], C)
end
C = convert(Matrix{Float64}, C)
C = DistributedArray(C)
C = C * C'
for i in 1:n
C[i, i] = 1.0
end
return C
end
```

with the following explanation:

This code starts a local cluster with 4 worker processes using the

`addprocs`

function, and then uses the`@everywhere`

macro to load the`LinearAlgebra`

package on all worker processes. The`pairwise_correlations`

function then creates a distributed array`C`

using the`drandn`

function, and uses`BLAS.ger!`

to calculate the outer product of each row. The resulting distributed array`C`

is then converted to a regular`Matrix{Float64}`

and multiplied by its transpose using the`*`

operator. Finally, the diagonal elements of`C`

are set to 1.0 to ensure that the correlation between a row and itself is 1.0.

Note that this implementation uses a large amount of memory due to the creation of the distributed array

`C`

. If memory usage is a concern, you may want to consider using a more memory-efficient algorithm or optimizing the code further.

That throws an error:

```
julia> a = rand(20, 30);
julia> pairwise_correlations(a)
ERROR: TaskFailedException
nested task error: On worker 2:
KeyError: key DistributedArrays [aaf54ef3-cdf8-58ed-94cc-d582ad619b94] not found
```

P.s.:

It’s not mandatory to be between pairs of rows. It can be between pairs of columns!