I wrote a short post on optimizing repeated correlations using Math (and Julia): https://www.lesswrong.com/posts/AESkD3gafBXx6pm77/optimizing-repeated-correlations

Some additional stuff here, using threads will also help. I personally like ThreadsX.jl. I didnβt use your adjusted version, just adding some extra computational resources helps quite a bit.

```
using Statistics
using BenchmarkTools
import ThreadsX
n = 1000
a = rand(n);
b = rand(n);
c = rand(n);
xs = [rand(n) for i in 1:10_000];
function raw_correlations(a, b, c, xs)
return [cor(x, y) for y in [a, b, c] for x in xs]
end
display(@benchmark raw_correlations($a, $b, $c, $xs))
function threaded_correlations(a, b, c, xs)
ThreadsX.map(xs) do x
[cor(x, y) for y in [a,b,c]]
end
end
display(@benchmark threaded_correlations($a, $b, $c, $xs))
```

This gets me the base timing:

```
BenchmarkTools.Trial: 284 samples with 1 evaluation.
Range (min β¦ max): 17.108 ms β¦ 21.202 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 17.510 ms β GC (median): 0.00%
Time (mean Β± Ο): 17.643 ms Β± 536.038 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
17.1 ms Histogram: frequency by time 20 ms <
Memory estimate: 812.81 KiB, allocs estimate: 12.
```

and the multithreaded version with 16 threads:

```
BenchmarkTools.Trial: 1552 samples with 1 evaluation.
Range (min β¦ max): 1.876 ms β¦ 88.282 ms β GC (min β¦ max): 0.00% β¦ 96.89%
Time (median): 2.614 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.208 ms Β± 3.527 ms β GC (mean Β± Ο): 6.73% Β± 6.68%
β
β
βββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.88 ms Histogram: log(frequency) by time 13.3 ms <
Memory estimate: 2.32 MiB, allocs estimate: 20497.
```

I suspect thereβs also a clever linear algebra trick here as well, but my dumb use of `cor`

alone ends up being relatively costly.

Linear code:

```
A = [a b c]
X = reduce(hcat, xs)
linalg_cor = vec(cor(A, X)')
```

The times for all three methods are

- Original: 16ms
- Threaded: 1.8ms
`cor`

: 40ms

Obviously, you could probably compose these with your z-score method as well for additional improvements.

I donβt think your codes compute the same things.

This needs to be:

```
julia> correlations2 = [zscores(x)'*y for y in [za, zb, zc] for x in xs];
julia> sum(abs2, correlations .- correlations2)
8.499807095096476e-30
```

But then itβs actually slower.

```
julia> @btime correlations2 = [zscores(x)'*y for y in [$za, $zb, $zc] for x in $xs];
76.307 ms (60012 allocations: 465.88 MiB)
julia> @btime correlations = [cor(x, y) for y in [$a, $b, $c] for x in $xs];
19.886 ms (12 allocations: 812.81 KiB)
```

We could use a non-allocating `zscores!`

but it still is slower:

```
function zscores!(x)
ΞΌ = mean(x)
Ο = std(x; mean=ΞΌ)
@. x = (x - ΞΌ)/Ο
x
end
julia> @btime correlations2 = [zscores!(x)'*y for y in [$za, $zb, $zc] for x in $xs];
29.185 ms (12 allocations: 812.81 KiB)
```

Thanks for the correction!

Note that your version is calculating `zscores(x)`

3 times per loop. If I use a version that calculates it once, and the non-allocating zscoresβ¦itβs still slightly slower.

```
a_length = 10_000
a = rand(a_length)
b = rand(a_length)
c = rand(a_length)
xs = [rand(a_length) for i in 1:10_000]
function get_correlations1(xs, a, b, c)
return [[cor(x, y) for y in [a, b, c]] for x in xs]
end
@btime correlations = get_correlations1($xs, $a, $b, $c)
382.133 ms (20002 allocations: 1.60 MiB)
function get_correlations3(xs, a, b, c)
la = length(a) - 1
za, zb, zc = zscores!.([a, b, c]) ./ la
output = Vector{Float64}[]
for x in xs
zx = zscores!(x)
push!(output, [zx' * y for y in [za, zb, zc]])
end
return output
end
@btime correlations3 = get_correlations3($xs, $a, $b, $c)
425.309 ms (60026 allocations: 766.16 MiB)
```

Iβm very confused, because I have a production use case where the zscores version is much, much faster, that I just tested again. So I need to figure out what the difference is. In the meantime, Iβve taken down the post.

Iβve now fixed this bug, and the current version of my code runs ~33% faster than the original version β much less exciting. Still not sure why Iβm getting a much larger effect in production β my best guess is that thereβs something efficient about the dot product on hundreds of millions of rows that doesnβt really show up on smaller datasets.

`<shameless_self_promotion>`

If you read this thread, you might be interested in the package `FastLocalCorrelationCoefficients.jl`

.

Notable characteristics:

- supports any number of dimensions (shapes)
- uses Fast Fourier transform for high-performance
- allows precomputation of common calculations to locate multiple βneedlesβ of the same size/shape in a βhaystackβ
- utilizes multithreading and distributed processing.

Help us make `FastLocalCorrelationCoefficients.jl`

better.