(Blog Post) Optimizing Repeated Correlations

Satvik · August 1, 2024, 6:05pm

I wrote a short post on optimizing repeated correlations using Math (and Julia): https://www.lesswrong.com/posts/AESkD3gafBXx6pm77/optimizing-repeated-correlations

cpfiffer · August 2, 2024, 1:05am

Some additional stuff here, using threads will also help. I personally like ThreadsX.jl. I didn’t use your adjusted version, just adding some extra computational resources helps quite a bit.

using Statistics
using BenchmarkTools
import ThreadsX

n = 1000
a = rand(n);
b = rand(n);
c = rand(n);
xs = [rand(n) for i in 1:10_000];

function raw_correlations(a, b, c, xs)
    return [cor(x, y) for y in [a, b, c] for x in xs]
end

display(@benchmark raw_correlations($a, $b, $c, $xs))

function threaded_correlations(a, b, c, xs)
    ThreadsX.map(xs) do x
        [cor(x, y) for y in [a,b,c]]
    end
end

display(@benchmark threaded_correlations($a, $b, $c, $xs))

This gets me the base timing:


BenchmarkTools.Trial: 284 samples with 1 evaluation.
 Range (min … max):  17.108 ms …  21.202 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.510 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.643 ms ± 536.038 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▂▄▂█▆██▂ ▁                                                 
  ▆▄███████████▄▄▄▄▃▃▁▄▃▂▂▄▁▃▂▂▁▂▂▃▁▃▁▂▄▃▁▁▁▁▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  17.1 ms         Histogram: frequency by time           20 ms <

 Memory estimate: 812.81 KiB, allocs estimate: 12.

and the multithreaded version with 16 threads:

BenchmarkTools.Trial: 1552 samples with 1 evaluation.
 Range (min … max):  1.876 ms … 88.282 ms  ┊ GC (min … max): 0.00% … 96.89%
 Time  (median):     2.614 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.208 ms ±  3.527 ms  ┊ GC (mean ± σ):  6.73% ±  6.68%

   ▅▅█▇▃▂▁            ▁                                       
  ████████▇█▇▄▄▄▆▇▆▆█████▆▆▄▁▁▄▁▄▁▄▄▄▆▄▁▁▁▄▄▄▁▁▁▁▁▁▁▁▁▄▁▄▄▁▄ █
  1.88 ms      Histogram: log(frequency) by time     13.3 ms <

 Memory estimate: 2.32 MiB, allocs estimate: 20497.

I suspect there’s also a clever linear algebra trick here as well, but my dumb use of cor alone ends up being relatively costly.

Linear code:

A = [a b c]
X = reduce(hcat, xs)
linalg_cor = vec(cor(A, X)')

The times for all three methods are

Original: 16ms
Threaded: 1.8ms
cor: 40ms

Obviously, you could probably compose these with your z-score method as well for additional improvements.

abraemer · August 2, 2024, 9:46am

I don’t think your codes compute the same things.

This needs to be:

julia> correlations2 = [zscores(x)'*y for y in [za, zb, zc] for x in xs];
julia> sum(abs2, correlations .- correlations2) 
8.499807095096476e-30

But then it’s actually slower.

julia> @btime correlations2 = [zscores(x)'*y for y in [$za, $zb, $zc] for x in $xs];
  76.307 ms (60012 allocations: 465.88 MiB)
julia> @btime correlations = [cor(x, y) for y in [$a, $b, $c] for x in $xs];
  19.886 ms (12 allocations: 812.81 KiB)

We could use a non-allocating zscores! but it still is slower:

function zscores!(x)
    μ = mean(x)
    σ = std(x; mean=μ)
    @. x = (x - μ)/σ
    x
end
julia> @btime correlations2 = [zscores!(x)'*y for y in [$za, $zb, $zc] for x in $xs];
  29.185 ms (12 allocations: 812.81 KiB)

Satvik · August 2, 2024, 2:54pm

Thanks for the correction!

Note that your version is calculating zscores(x) 3 times per loop. If I use a version that calculates it once, and the non-allocating zscores…it’s still slightly slower.

a_length = 10_000
a = rand(a_length)
b = rand(a_length)
c = rand(a_length)
xs = [rand(a_length) for i in 1:10_000]

function get_correlations1(xs, a, b, c)
    return [[cor(x, y) for y in [a, b, c]] for x in xs]
end

@btime correlations = get_correlations1($xs, $a, $b, $c)
  382.133 ms (20002 allocations: 1.60 MiB)

function get_correlations3(xs, a, b, c)
    la = length(a) - 1
    za, zb, zc = zscores!.([a, b, c]) ./ la
    output = Vector{Float64}[]
    for x in xs
        zx = zscores!(x)
        push!(output, [zx' * y for y in [za, zb, zc]])
    end
    return output
end

@btime correlations3 = get_correlations3($xs, $a, $b, $c)
  425.309 ms (60026 allocations: 766.16 MiB)

I’m very confused, because I have a production use case where the zscores version is much, much faster, that I just tested again. So I need to figure out what the difference is. In the meantime, I’ve taken down the post.

Satvik · August 3, 2024, 4:27pm

I’ve now fixed this bug, and the current version of my code runs ~33% faster than the original version – much less exciting. Still not sure why I’m getting a much larger effect in production – my best guess is that there’s something efficient about the dot product on hundreds of millions of rows that doesn’t really show up on smaller datasets.

pitsianis · August 3, 2024, 7:58pm

<shameless_self_promotion>

If you read this thread, you might be interested in the package FastLocalCorrelationCoefficients.jl.

Notable characteristics:

supports any number of dimensions (shapes)
uses Fast Fourier transform for high-performance
allows precomputation of common calculations to locate multiple “needles” of the same size/shape in a “haystack”
utilizes multithreading and distributed processing.

Help us make FastLocalCorrelationCoefficients.jl better.

Topic		Replies	Views
Having issues speeding up code with multithreading Performance parallel , multithreading	19	607	July 16, 2023
The most efficient way to calculate the pairwise correlation between rows of a large Matrix{Float64} Performance performance	10	798	May 25, 2023
Data structures for threaded computing Performance	23	2846	October 23, 2019
What's the problem with this simple multi-thread code? General Usage question	17	1148	March 11, 2022
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1462	May 1, 2023

(Blog Post) Optimizing Repeated Correlations

Related topics