The most efficient way to calculate the pairwise correlation between rows of a large Matrix{Float64}

Shayan · May 24, 2023, 8:37am

I wonder the most efficient way to calculate the pairwise correlation between rows of a Matrix{Float64} with a large number of rows and columns. I know “large” is subjective. Suppose the large is (5000, 50000) (meaning 5000 rows, and 50000 columns). I asked GitHub Copilot chat for a suggestion about a high-performance code in this regard, and I got the following snippet as the suggestion:

using LinearAlgebra

function pairwise_correlations(A::Matrix{Float64})
    n = size(A, 1)
    C = similar(A, n, n)
    for i in 1:n
        BLAS.ger!(1.0, A[i, :], A[i, :], C)
    end
    BLAS.syrk!('L', 'T', 1.0, C, 0.0, C)
    for i in 1:n
        C[i, i] = 1.0
    end
    return C
end

And the explanation:

Note that calculating pairwise correlations between a large number of rows can be computationally expensive, so it’s important to consider the performance implications of this operation. The BLAS functions used in this example code snippet are highly optimized and can provide significant performance improvements over other methods, but they may not be the most efficient solution for all use cases.

Surprisingly, the code is super slow that I couldn’t measure it using @benchmark. Still, the following snippet works better:

julia> a = rand(5000, 50000);

julia> using Statistics

julia> @benchmark cor($a, dims=2)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 23.153 s (0.00% GC) to evaluate,
 with a memory estimate of 2.05 GiB, over 18 allocations.

I wonder how this can be improved.
Another suggested snippet is:

using Distributed
using DistributedArrays

# Start a local cluster with 4 worker processes
addprocs(4)
@everywhere using LinearAlgebra

function pairwise_correlations(A::Matrix{Float64})
    n = size(A, 1)
    C = drandn(n, n)
    for i in 1:n
        BLAS.ger!(1.0, A[i, :], A[i, :], C)
    end
    C = convert(Matrix{Float64}, C)
    C = DistributedArray(C)
    C = C * C'
    for i in 1:n
        C[i, i] = 1.0
    end
    return C
end

with the following explanation:

This code starts a local cluster with 4 worker processes using the addprocs function, and then uses the @everywhere macro to load the LinearAlgebra package on all worker processes. The pairwise_correlations function then creates a distributed array C using the drandn function, and uses BLAS.ger! to calculate the outer product of each row. The resulting distributed array C is then converted to a regular Matrix{Float64} and multiplied by its transpose using the * operator. Finally, the diagonal elements of C are set to 1.0 to ensure that the correlation between a row and itself is 1.0.

Note that this implementation uses a large amount of memory due to the creation of the distributed array C. If memory usage is a concern, you may want to consider using a more memory-efficient algorithm or optimizing the code further.

That throws an error:

julia> a = rand(20, 30);

julia> pairwise_correlations(a)
ERROR: TaskFailedException

    nested task error: On worker 2:
    KeyError: key DistributedArrays [aaf54ef3-cdf8-58ed-94cc-d582ad619b94] not found

P.s.:
It’s not mandatory to be between pairs of rows. It can be between pairs of columns!

DNF · May 24, 2023, 9:22am

Re: the title. I guess you mean “a large matrix”, not “matrix with a large number of dimensions” (matrices have exactly two dimensions.)

Shayan · May 24, 2023, 9:23am

Yes, exactly. I’ll edit it.

gdalle · May 24, 2023, 10:19am

Hi there,

For your first example, I don’t know what the BLAS functions are doing but my best guess is the low perf comes from the systematic copies induced when you write A[i, :]. I guess Copilot hasn’t read the documentation on views I’ll try it to see how much it improves.
EDIT: @views reduces allocations but not runtime, gotta investigate some more. Maybe LoopVectorization or OnlineStats could help?

For your second example, if the matrix fits in the RAM on a single core, I’m not sure distributed computing is your best option: multithreading might be a better fit?

More worryingly, the Copilot code seems to error when A is not square? This suggests that you’re not computing what you want.

Shayan · May 24, 2023, 10:48am

@gdalle Hi!

I’ll try it to see how much it improves.

Thank you so much! I am unsure how much improvement can be made by using @views since the amount of computation is enormous.

multithreading might be a better fit?

Yes, it can enhance it.

the Copilot code seems to error when A is not square?

On which suggestion? First one? Yes, I get an error in either a square matrix or a non-square matrix.

Shayan · May 24, 2023, 11:00am

I’m not sure about the Copilot Chat’s suggestions! I’m not sure if it calculates what I want. Hence, it is unreasonable to put time on LoopVectorization.jl for example, unless we are confident with the code.

gdalle · May 24, 2023, 11:14am

By default I don’t trust Copilot, especially in this case because it errors, so if you use LoopVectorization.jl, definitely do it on something you wrote

mikmoore · May 24, 2023, 3:19pm

EDIT: I did not account for the means. Sorry, “correlation” is used to mean a lot of things in different places and I didn’t look into what definition you were using. One should do that first and might as well do the normalization up-front while they’re at it.

How about starting simple?

function rowcorr(x::AbstractMatrix)
	d = sum(abs2, x; dims=2)
	d .= sqrt.(d) # row norms
	corr = x * x' # compute correlation
	corr ./= (d .* d') # normalize correlation
	return corr
end

For large inputs, the dominant cost here is correlating each row with each other row. You’ll want to use an optimized linear algebra routine for this because you won’t beat it with something hand-written.

The only major savings we can get here is a factor of 2 by only computing half the matrix. syrk/herk are the tools for this, where applicable, but where they aren’t you’re probably better off using an optimized routine to compute the full matrix rather than a non-optimized one to compute half.

function rowcorr2(x::AbstractMatrix{T}) where T
	d = sum(abs2, x; dims=2)
	d .= sqrt.(d) # row norms
	if T <: LinearAlgebra.BlasReal
		corr = LinearAlgebra.BLAS.syrk('U', 'N', one(real(T)), x) # upper half of x * x'
		LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
	elseif T <: LinearAlgebra.BlasComplex
		corr = LinearAlgebra.BLAS.herk('U', 'N', one(real(T)), x) # upper half of x * x'
		LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
	else
		corr = x * x'
	end
	corr ./= (d .* d')
	# note: rather than copytri! above, we could `return Hermitian(corr,:U)`
	return corr
end

The remaining thing to do here would be to only apply the normalization to the relevant half of the matrix, but this is an O(M^2) operation that will be much cheaper than the O(M^2N) matrix multiply for large inputs. We could also save a teensie bit by returning a Hermitian-wrapped matrix so we don’t have to copy the top half to the bottom either, but again this is O(M^2) savings.

Shayan · May 24, 2023, 3:43pm

Thanks for the great explanation and time you’ve put into this. Unfortunately, the result isn’t equal to the one that is achieved by Statistics.cor:

julia> foo = rand(2, 5);

julia> rowcorr2(foo) == cor(foo, dims=2)
false

julia> rowcorr2(foo)
2×2 Matrix{Float64}:
 1.0       0.860513
 0.860513  1.0

julia> cor(foo, dims=2)
2×2 Matrix{Float64}:
 1.0       0.222276
 0.222276  1.0

tbeason · May 24, 2023, 3:49pm

Likely the difference is that @mikmoore did not subtract the means first. Don’t want to forget about that!

Shayan · May 25, 2023, 11:02am

Hence, by adding the mean subtraction, the code will be:

code

using LinearAlgebra

function rowcorr2(x::AbstractMatrix{T}) where T
	# subtract mean
	x .-= mean(x, dims=2)
	d = sum(abs2, x; dims=2)
	d .= sqrt.(d) # row norms
	if T <: LinearAlgebra.BlasReal
		corr = LinearAlgebra.BLAS.syrk('U', 'N', one(real(T)), x) # upper half of x * x'
		LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
	elseif T <: LinearAlgebra.BlasComplex
		corr = LinearAlgebra.BLAS.herk('U', 'N', one(real(T)), x) # upper half of x * x'
		LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
	else
		corr = x * x'
	end
	corr ./= (d .* d')
	# note: rather than copytri! above, we could `return Hermitian(corr,:U)`
	return corr
end

Benchmarking

row-wise

julia> using Statistics

julia> foo = rand(2, 5);

julia> bar = rowcorr2(foo);

julia> baz = cor(foo, dims=2);

julia> bar ≈ baz
true

julia> foo = rand(5000, 50000);

julia> @benchmark rowcorr2($foo)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 22.671 s (0.01% GC) to evaluate,
 with a memory estimate of 190.81 MiB, over 20 allocations.

julia> @benchmark cor($foo, dims=2)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 21.120 s (0.01% GC) to evaluate,
 with a memory estimate of 2.05 GiB, over 18 allocations.

Not much difference regarding speed, but much more memory efficient.

julia> foo = rand(50, 2000);

julia> @benchmark rowcorr2($foo)
BenchmarkTools.Trial: 7806 samples with 1 evaluation.
 Range (min … max):  588.300 μs …   9.986 ms  ┊ GC (min … max): 0.00% … 92.54%   
 Time  (median):     604.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   635.300 μs ± 197.516 μs  ┊ GC (mean ± σ):  0.51% ±  1.81%   

   ▇█▅▃▂▁▁▁▁▁▁                                                  ▁
  ██████████████▇▇▇▆▆▆▆▆▅▆▅▆▅▅▅▄▄▄▄▄▅▄▄▃▅▄▅▄▅▄▅▅▃▆▄▆▅▆▇▆▇▇▆▆▅▂▅ █
  588 μs        Histogram: log(frequency) by time       1.03 ms <

 Memory estimate: 21.27 KiB, allocs estimate: 18.

julia> @benchmark cor($foo, dims=2)
BenchmarkTools.Trial: 6062 samples with 1 evaluation.
 Range (min … max):  552.500 μs …   5.785 ms  ┊ GC (min … max): 0.00% … 83.36%
 Time  (median):     874.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   820.093 μs ± 454.588 μs  ┊ GC (mean ± σ):  6.21% ±  9.79%   

  █▁    █▆▃▂▂   ▁                                               ▁
  ██▆▄▄▇██████▇▆█▇▆▄▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▃▅▄▅▅▅▁▅▄ █
  552 μs        Histogram: log(frequency) by time       3.49 ms <

 Memory estimate: 802.48 KiB, allocs estimate: 16.

On mid-size data, rowcorr2 performs better either on speed and memory.

column-wise

modified code

using LinearAlgebra
function colcorr2(x::AbstractMatrix{T}) where T
  # subtract mean
  x .-= mean(x, dims=1)
  d = sum(abs2, x; dims=1)
  d .= sqrt.(d) # column norms
  if T <: LinearAlgebra.BlasReal
      corr = LinearAlgebra.BLAS.syrk('U', 'T', one(real(T)), x) # upper half of x' * x
      LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
  elseif T <: LinearAlgebra.BlasComplex
      corr = LinearAlgebra.BLAS.herk('U', 'T', one(real(T)), x) # upper half of x' * x
      LinearAlgebra.copytri!(corr, 'U', true) # copy upper to lower half
  else
      corr = x' * x
  end
  corr ./= (d' .* d)
  # note: rather than copytri! above, we could `return Hermitian(corr,:U)`
  return corr
end

julia> foo = rand(2000, 50);

julia> @benchmark colcorr2($foo)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  470.000 μs …  56.154 ms  ┊ GC (min … max): 0.00% … 98.97%   
 Time  (median):     483.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   495.929 μs ± 557.303 μs  ┊ GC (mean ± σ):  1.12% ±  0.99%   

     ▂▆██▇▅▅▄▃▂▂▁▁                                              ▂
  █▇▅███████████████▇▇█▇▇▆▆▇▇▇▇█▇▇▆▇▇▇█▇█▆▇▇▆▅▆▆▅▅▆▄▄▅▅▅▄▄▄▃▄▄▃ █
  470 μs        Histogram: log(frequency) by time        596 μs <

 Memory estimate: 21.11 KiB, allocs estimate: 10.

julia> @benchmark cor($foo, dims=1)
BenchmarkTools.Trial: 5956 samples with 1 evaluation.
 Range (min … max):  469.400 μs …  20.202 ms  ┊ GC (min … max): 0.00% … 94.30%   
 Time  (median):     811.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   834.909 μs ± 854.873 μs  ┊ GC (mean ± σ):  6.55% ±  6.14%   

  ▄                            ▂▃█▃   ▂
  █▅▃▂▂▂▂▂▂▂▁▁▂▂▁▂▁▂▂▂▂▁▂▁▂▁▁▂▄████▅▄▄██▆▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  469 μs           Histogram: frequency by time         1.13 ms <

 Memory estimate: 802.41 KiB, allocs estimate: 12.

julia> a = rand(50000, 5000);

julia> @benchmark colcorr2($foo)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  52.571 ms … 105.451 ms  ┊ GC (min … max): 0.00% … 19.36%    
 Time  (median):     55.065 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.238 ms ±  10.227 ms  ┊ GC (mean ± σ):  5.61% ±  9.54%    

  █▁ ▂       
  ██▆█▆▆▄▁▃▃▄▃▄▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▃▃▁▄▁▃▁▁▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▁
  52.6 ms         Histogram: frequency by time           92 ms <

 Memory estimate: 30.55 MiB, allocs estimate: 10.

julia> @benchmark cor($foo, dims=1)
BenchmarkTools.Trial: 87 samples with 1 evaluation.
 Range (min … max):  53.600 ms … 81.556 ms  ┊ GC (min … max): 0.00% … 23.03%
 Time  (median):     54.265 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.915 ms ±  8.148 ms  ┊ GC (mean ± σ):  5.56% ±  9.78%     

  ▅█▂        
  ███▆▅▆▁▁▅▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▅▁▁▇▁▁▇▁▅▁▆▅▁▁▅ ▁
  53.6 ms      Histogram: log(frequency) by time      79.8 ms <

 Memory estimate: 31.31 MiB, allocs estimate: 12.

Summary

	data size	Speed	Memory
row-wise	Mid-size data	`rowcorr2`	`rowcorr2`
row-wise	Large-size data	`Statistics.cor`	`rowcorr2`
column-wise	Mid-size data	`colcorr2`	`colcorr2`
column-wise	Large-size data	`Statistics.cor`	`colcorr2`

It appears that on large-size data, Statistics.cor performs slightly better than its corresponding competitor regarding speed in either column-wise or row-wise manner. But, on the mid-size data, the custom functions perform better in terms of speed and memory. It is worth noting that the custom functions are much more efficient regarding allocated memory on all of the settings. Last but not least, although the Statistics.cor appears to be faster on large-size data, it compels a huge amount of memory consumption in comparison with its corresponding custom function, which from my point of view, can be rolled out on balance since the speed difference isn’t that much, but the memory allocation.

Thank you so much
@gdalle
@DNF
@mikmoore
@tbeason