(Blog Post) Optimizing Repeated Correlations

I wrote a short post on optimizing repeated correlations using Math (and Julia): https://www.lesswrong.com/posts/AESkD3gafBXx6pm77/optimizing-repeated-correlations

4 Likes

Some additional stuff here, using threads will also help. I personally like ThreadsX.jl. I didn’t use your adjusted version, just adding some extra computational resources helps quite a bit.

using Statistics
using BenchmarkTools
import ThreadsX

n = 1000
a = rand(n);
b = rand(n);
c = rand(n);
xs = [rand(n) for i in 1:10_000];

function raw_correlations(a, b, c, xs)
    return [cor(x, y) for y in [a, b, c] for x in xs]
end

display(@benchmark raw_correlations($a, $b, $c, $xs))

function threaded_correlations(a, b, c, xs)
    ThreadsX.map(xs) do x
        [cor(x, y) for y in [a,b,c]]
    end
end

display(@benchmark threaded_correlations($a, $b, $c, $xs))

This gets me the base timing:


BenchmarkTools.Trial: 284 samples with 1 evaluation.
 Range (min … max):  17.108 ms …  21.202 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     17.510 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.643 ms Β± 536.038 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‚β–‚β–„β–‚β–ˆβ–†β–ˆβ–ˆβ–‚ ▁                                                 
  β–†β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–„β–„β–„β–ƒβ–ƒβ–β–„β–ƒβ–‚β–‚β–„β–β–ƒβ–‚β–‚β–β–‚β–‚β–ƒβ–β–ƒβ–β–‚β–„β–ƒβ–β–β–β–β–‚β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚ β–ƒ
  17.1 ms         Histogram: frequency by time           20 ms <

 Memory estimate: 812.81 KiB, allocs estimate: 12.

and the multithreaded version with 16 threads:

BenchmarkTools.Trial: 1552 samples with 1 evaluation.
 Range (min … max):  1.876 ms … 88.282 ms  β”Š GC (min … max): 0.00% … 96.89%
 Time  (median):     2.614 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.208 ms Β±  3.527 ms  β”Š GC (mean Β± Οƒ):  6.73% Β±  6.68%

   β–…β–…β–ˆβ–‡β–ƒβ–‚β–            ▁                                       
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–„β–„β–„β–†β–‡β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–β–β–„β–β–„β–β–„β–„β–„β–†β–„β–β–β–β–„β–„β–„β–β–β–β–β–β–β–β–β–β–„β–β–„β–„β–β–„ β–ˆ
  1.88 ms      Histogram: log(frequency) by time     13.3 ms <

 Memory estimate: 2.32 MiB, allocs estimate: 20497.

I suspect there’s also a clever linear algebra trick here as well, but my dumb use of cor alone ends up being relatively costly.

Linear code:

A = [a b c]
X = reduce(hcat, xs)
linalg_cor = vec(cor(A, X)')

The times for all three methods are

  • Original: 16ms
  • Threaded: 1.8ms
  • cor: 40ms

Obviously, you could probably compose these with your z-score method as well for additional improvements.

1 Like

I don’t think your codes compute the same things.

This needs to be:

julia> correlations2 = [zscores(x)'*y for y in [za, zb, zc] for x in xs];
julia> sum(abs2, correlations .- correlations2) 
8.499807095096476e-30

But then it’s actually slower.

julia> @btime correlations2 = [zscores(x)'*y for y in [$za, $zb, $zc] for x in $xs];
  76.307 ms (60012 allocations: 465.88 MiB)
julia> @btime correlations = [cor(x, y) for y in [$a, $b, $c] for x in $xs];
  19.886 ms (12 allocations: 812.81 KiB)

We could use a non-allocating zscores! but it still is slower:

function zscores!(x)
    ΞΌ = mean(x)
    Οƒ = std(x; mean=ΞΌ)
    @. x = (x - ΞΌ)/Οƒ
    x
end
julia> @btime correlations2 = [zscores!(x)'*y for y in [$za, $zb, $zc] for x in $xs];
  29.185 ms (12 allocations: 812.81 KiB)
3 Likes

Thanks for the correction!

Note that your version is calculating zscores(x) 3 times per loop. If I use a version that calculates it once, and the non-allocating zscores…it’s still slightly slower.

a_length = 10_000
a = rand(a_length)
b = rand(a_length)
c = rand(a_length)
xs = [rand(a_length) for i in 1:10_000]

function get_correlations1(xs, a, b, c)
    return [[cor(x, y) for y in [a, b, c]] for x in xs]
end

@btime correlations = get_correlations1($xs, $a, $b, $c)
  382.133 ms (20002 allocations: 1.60 MiB)

function get_correlations3(xs, a, b, c)
    la = length(a) - 1
    za, zb, zc = zscores!.([a, b, c]) ./ la
    output = Vector{Float64}[]
    for x in xs
        zx = zscores!(x)
        push!(output, [zx' * y for y in [za, zb, zc]])
    end
    return output
end

@btime correlations3 = get_correlations3($xs, $a, $b, $c)
  425.309 ms (60026 allocations: 766.16 MiB)

I’m very confused, because I have a production use case where the zscores version is much, much faster, that I just tested again. So I need to figure out what the difference is. In the meantime, I’ve taken down the post.

I’ve now fixed this bug, and the current version of my code runs ~33% faster than the original version – much less exciting. Still not sure why I’m getting a much larger effect in production – my best guess is that there’s something efficient about the dot product on hundreds of millions of rows that doesn’t really show up on smaller datasets.

1 Like

<shameless_self_promotion>

If you read this thread, you might be interested in the package FastLocalCorrelationCoefficients.jl.

Notable characteristics:

  • supports any number of dimensions (shapes)
  • uses Fast Fourier transform for high-performance
  • allows precomputation of common calculations to locate multiple β€œneedles” of the same size/shape in a β€œhaystack”
  • utilizes multithreading and distributed processing.

Help us make FastLocalCorrelationCoefficients.jl better.

2 Likes