# The most efficient way to calculate the pairwise correlation between rows of a large Matrix{Float64}

I wonder the most efficient way to calculate the pairwise correlation between rows of a `Matrix{Float64}` with a large number of rows and columns. I know “large” is subjective. Suppose the large is `(5000, 50000)` (meaning `5000` rows, and `50000` columns). I asked GitHub Copilot chat for a suggestion about a high-performance code in this regard, and I got the following snippet as the suggestion:

``````using LinearAlgebra

function pairwise_correlations(A::Matrix{Float64})
n = size(A, 1)
C = similar(A, n, n)
for i in 1:n
BLAS.ger!(1.0, A[i, :], A[i, :], C)
end
BLAS.syrk!('L', 'T', 1.0, C, 0.0, C)
for i in 1:n
C[i, i] = 1.0
end
return C
end
``````

And the explanation:

Note that calculating pairwise correlations between a large number of rows can be computationally expensive, so it’s important to consider the performance implications of this operation. The `BLAS` functions used in this example code snippet are highly optimized and can provide significant performance improvements over other methods, but they may not be the most efficient solution for all use cases.

Surprisingly, the code is super slow that I couldn’t measure it using `@benchmark`. Still, the following snippet works better:

``````julia> a = rand(5000, 50000);

julia> using Statistics

julia> @benchmark cor(\$a, dims=2)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 23.153 s (0.00% GC) to evaluate,
with a memory estimate of 2.05 GiB, over 18 allocations.
``````

I wonder how this can be improved.
Another suggested snippet is:

``````using Distributed
using DistributedArrays

# Start a local cluster with 4 worker processes
@everywhere using LinearAlgebra

function pairwise_correlations(A::Matrix{Float64})
n = size(A, 1)
C = drandn(n, n)
for i in 1:n
BLAS.ger!(1.0, A[i, :], A[i, :], C)
end
C = convert(Matrix{Float64}, C)
C = DistributedArray(C)
C = C * C'
for i in 1:n
C[i, i] = 1.0
end
return C
end
``````

with the following explanation:

This code starts a local cluster with 4 worker processes using the `addprocs` function, and then uses the `@everywhere` macro to load the `LinearAlgebra` package on all worker processes. The `pairwise_correlations` function then creates a distributed array `C` using the `drandn` function, and uses `BLAS.ger!` to calculate the outer product of each row. The resulting distributed array `C` is then converted to a regular `Matrix{Float64}` and multiplied by its transpose using the `*` operator. Finally, the diagonal elements of `C` are set to 1.0 to ensure that the correlation between a row and itself is 1.0.

Note that this implementation uses a large amount of memory due to the creation of the distributed array `C`. If memory usage is a concern, you may want to consider using a more memory-efficient algorithm or optimizing the code further.

That throws an error:

``````julia> a = rand(20, 30);

julia> pairwise_correlations(a)

nested task error: On worker 2:
``````

P.s.:
It’s not mandatory to be between pairs of rows. It can be between pairs of columns!

Re: the title. I guess you mean “a large matrix”, not “matrix with a large number of dimensions” (matrices have exactly two dimensions.)

1 Like

Yes, exactly. I’ll edit it.

Hi there,

For your first example, I don’t know what the BLAS functions are doing but my best guess is the low perf comes from the systematic copies induced when you write `A[i, :]`. I guess Copilot hasn’t read the documentation on views I’ll try it to see how much it improves.
EDIT: `@views` reduces allocations but not runtime, gotta investigate some more. Maybe LoopVectorization or OnlineStats could help?

For your second example, if the matrix fits in the RAM on a single core, I’m not sure distributed computing is your best option: multithreading might be a better fit?

More worryingly, the Copilot code seems to error when A is not square? This suggests that you’re not computing what you want.

@gdalle Hi!

I’ll try it to see how much it improves.

Thank you so much! I am unsure how much improvement can be made by using `@views` since the amount of computation is enormous.

multithreading might be a better fit?

Yes, it can enhance it.

the Copilot code seems to error when A is not square?

On which suggestion? First one? Yes, I get an error in either a square matrix or a non-square matrix.

I’m not sure about the Copilot Chat’s suggestions! I’m not sure if it calculates what I want. Hence, it is unreasonable to put time on `LoopVectorization.jl` for example, unless we are confident with the code.

By default I don’t trust Copilot, especially in this case because it errors, so if you use LoopVectorization.jl, definitely do it on something you wrote

1 Like

EDIT: I did not account for the means. Sorry, “correlation” is used to mean a lot of things in different places and I didn’t look into what definition you were using. One should do that first and might as well do the normalization up-front while they’re at it.

``````function rowcorr(x::AbstractMatrix)
d = sum(abs2, x; dims=2)
d .= sqrt.(d) # row norms
corr = x * x' # compute correlation
corr ./= (d .* d') # normalize correlation
return corr
end
``````

For large inputs, the dominant cost here is correlating each row with each other row. You’ll want to use an optimized linear algebra routine for this because you won’t beat it with something hand-written.

The only major savings we can get here is a factor of 2 by only computing half the matrix. `syrk`/`herk` are the tools for this, where applicable, but where they aren’t you’re probably better off using an optimized routine to compute the full matrix rather than a non-optimized one to compute half.

``````function rowcorr2(x::AbstractMatrix{T}) where T
d = sum(abs2, x; dims=2)
d .= sqrt.(d) # row norms
if T <: LinearAlgebra.BlasReal
corr = LinearAlgebra.BLAS.syrk('U', 'N', one(real(T)), x) # upper half of x * x'
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
elseif T <: LinearAlgebra.BlasComplex
corr = LinearAlgebra.BLAS.herk('U', 'N', one(real(T)), x) # upper half of x * x'
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
else
corr = x * x'
end
corr ./= (d .* d')
# note: rather than copytri! above, we could `return Hermitian(corr,:U)`
return corr
end
``````

The remaining thing to do here would be to only apply the normalization to the relevant half of the matrix, but this is an O(M^2) operation that will be much cheaper than the O(M^2N) matrix multiply for large inputs. We could also save a teensie bit by returning a `Hermitian`-wrapped matrix so we don’t have to copy the top half to the bottom either, but again this is O(M^2) savings.

2 Likes

Thanks for the great explanation and time you’ve put into this. Unfortunately, the result isn’t equal to the one that is achieved by `Statistics.cor`:

``````julia> foo = rand(2, 5);

julia> rowcorr2(foo) == cor(foo, dims=2)
false

julia> rowcorr2(foo)
2×2 Matrix{Float64}:
1.0       0.860513
0.860513  1.0

julia> cor(foo, dims=2)
2×2 Matrix{Float64}:
1.0       0.222276
0.222276  1.0
``````
1 Like

Likely the difference is that @mikmoore did not subtract the means first. Don’t want to forget about that!

3 Likes

Hence, by adding the mean subtraction, the code will be:

code
``````using LinearAlgebra

function rowcorr2(x::AbstractMatrix{T}) where T
# subtract mean
x .-= mean(x, dims=2)
d = sum(abs2, x; dims=2)
d .= sqrt.(d) # row norms
if T <: LinearAlgebra.BlasReal
corr = LinearAlgebra.BLAS.syrk('U', 'N', one(real(T)), x) # upper half of x * x'
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
elseif T <: LinearAlgebra.BlasComplex
corr = LinearAlgebra.BLAS.herk('U', 'N', one(real(T)), x) # upper half of x * x'
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
else
corr = x * x'
end
corr ./= (d .* d')
# note: rather than copytri! above, we could `return Hermitian(corr,:U)`
return corr
end
``````

# Benchmarking

row-wise
``````julia> using Statistics

julia> foo = rand(2, 5);

julia> bar = rowcorr2(foo);

julia> baz = cor(foo, dims=2);

julia> bar ≈ baz
true

julia> foo = rand(5000, 50000);

julia> @benchmark rowcorr2(\$foo)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 22.671 s (0.01% GC) to evaluate,
with a memory estimate of 190.81 MiB, over 20 allocations.

julia> @benchmark cor(\$foo, dims=2)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 21.120 s (0.01% GC) to evaluate,
with a memory estimate of 2.05 GiB, over 18 allocations.
``````

Not much difference regarding speed, but much more memory efficient.

``````julia> foo = rand(50, 2000);

julia> @benchmark rowcorr2(\$foo)
BenchmarkTools.Trial: 7806 samples with 1 evaluation.
Range (min … max):  588.300 μs …   9.986 ms  ┊ GC (min … max): 0.00% … 92.54%
Time  (median):     604.800 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   635.300 μs ± 197.516 μs  ┊ GC (mean ± σ):  0.51% ±  1.81%

▇█▅▃▂▁▁▁▁▁▁                                                  ▁
██████████████▇▇▇▆▆▆▆▆▅▆▅▆▅▅▅▄▄▄▄▄▅▄▄▃▅▄▅▄▅▄▅▅▃▆▄▆▅▆▇▆▇▇▆▆▅▂▅ █
588 μs        Histogram: log(frequency) by time       1.03 ms <

Memory estimate: 21.27 KiB, allocs estimate: 18.

julia> @benchmark cor(\$foo, dims=2)
BenchmarkTools.Trial: 6062 samples with 1 evaluation.
Range (min … max):  552.500 μs …   5.785 ms  ┊ GC (min … max): 0.00% … 83.36%
Time  (median):     874.200 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   820.093 μs ± 454.588 μs  ┊ GC (mean ± σ):  6.21% ±  9.79%

█▁    █▆▃▂▂   ▁                                               ▁
██▆▄▄▇██████▇▆█▇▆▄▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▃▅▄▅▅▅▁▅▄ █
552 μs        Histogram: log(frequency) by time       3.49 ms <

Memory estimate: 802.48 KiB, allocs estimate: 16.
``````

On mid-size data, `rowcorr2` performs better either on speed and memory.

column-wise
modified code
``````using LinearAlgebra
function colcorr2(x::AbstractMatrix{T}) where T
# subtract mean
x .-= mean(x, dims=1)
d = sum(abs2, x; dims=1)
d .= sqrt.(d) # column norms
if T <: LinearAlgebra.BlasReal
corr = LinearAlgebra.BLAS.syrk('U', 'T', one(real(T)), x) # upper half of x' * x
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
elseif T <: LinearAlgebra.BlasComplex
corr = LinearAlgebra.BLAS.herk('U', 'T', one(real(T)), x) # upper half of x' * x
LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
else
corr = x' * x
end
corr ./= (d' .* d)
# note: rather than copytri! above, we could `return Hermitian(corr,:U)`
return corr
end
``````
``````julia> foo = rand(2000, 50);

julia> @benchmark colcorr2(\$foo)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  470.000 μs …  56.154 ms  ┊ GC (min … max): 0.00% … 98.97%
Time  (median):     483.900 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   495.929 μs ± 557.303 μs  ┊ GC (mean ± σ):  1.12% ±  0.99%

▂▆██▇▅▅▄▃▂▂▁▁                                              ▂
█▇▅███████████████▇▇█▇▇▆▆▇▇▇▇█▇▇▆▇▇▇█▇█▆▇▇▆▅▆▆▅▅▆▄▄▅▅▅▄▄▄▃▄▄▃ █
470 μs        Histogram: log(frequency) by time        596 μs <

Memory estimate: 21.11 KiB, allocs estimate: 10.

julia> @benchmark cor(\$foo, dims=1)
BenchmarkTools.Trial: 5956 samples with 1 evaluation.
Range (min … max):  469.400 μs …  20.202 ms  ┊ GC (min … max): 0.00% … 94.30%
Time  (median):     811.250 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   834.909 μs ± 854.873 μs  ┊ GC (mean ± σ):  6.55% ±  6.14%

▄                            ▂▃█▃   ▂
█▅▃▂▂▂▂▂▂▂▁▁▂▂▁▂▁▂▂▂▂▁▂▁▂▁▁▂▄████▅▄▄██▆▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
469 μs           Histogram: frequency by time         1.13 ms <

Memory estimate: 802.41 KiB, allocs estimate: 12.

julia> a = rand(50000, 5000);

julia> @benchmark colcorr2(\$foo)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
Range (min … max):  52.571 ms … 105.451 ms  ┊ GC (min … max): 0.00% … 19.36%
Time  (median):     55.065 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   59.238 ms ±  10.227 ms  ┊ GC (mean ± σ):  5.61% ±  9.54%

█▁ ▂
██▆█▆▆▄▁▃▃▄▃▄▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▃▃▁▄▁▃▁▁▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▁
52.6 ms         Histogram: frequency by time           92 ms <

Memory estimate: 30.55 MiB, allocs estimate: 10.

julia> @benchmark cor(\$foo, dims=1)
BenchmarkTools.Trial: 87 samples with 1 evaluation.
Range (min … max):  53.600 ms … 81.556 ms  ┊ GC (min … max): 0.00% … 23.03%
Time  (median):     54.265 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   57.915 ms ±  8.148 ms  ┊ GC (mean ± σ):  5.56% ±  9.78%

▅█▂
███▆▅▆▁▁▅▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▁▁▇▁▁▇▁▅▁▆▅▁▁▅ ▁
53.6 ms      Histogram: log(frequency) by time      79.8 ms <

Memory estimate: 31.31 MiB, allocs estimate: 12.
``````

# Summary

data size Speed Memory
row-wise Mid-size data `rowcorr2` `rowcorr2`
row-wise Large-size data `Statistics.cor` `rowcorr2`
column-wise Mid-size data `colcorr2` `colcorr2`
column-wise Large-size data `Statistics.cor` `colcorr2`

It appears that on large-size data, `Statistics.cor` performs slightly better than its corresponding competitor regarding speed in either column-wise or row-wise manner. But, on the mid-size data, the custom functions perform better in terms of speed and memory. It is worth noting that the custom functions are much more efficient regarding allocated memory on all of the settings. Last but not least, although the `Statistics.cor` appears to be faster on large-size data, it compels a huge amount of memory consumption in comparison with its corresponding custom function, which from my point of view, can be rolled out on balance since the speed difference isn’t that much, but the memory allocation.

Thank you so much
@gdalle
@DNF
@mikmoore
@tbeason

1 Like