Fast offset matrix subtraction

I have a matrix h \in \mathbb{R}^{n_n \times n_t} where n_n is the number of sensors and n_t is the number of hours in a day. I want to understand the variation between each timestamp for each sensor (note times roll over between days), so I have:
\frac{1}{n_{n}} (\sum_{i=1}^{n_n} (h_i^1-h_i^{n_t})^2+\sum_{t=2}^{n_t}\sum_{i=1}^{n_n} (h_i^t-h_i^{t-1})^2)

I’ve tried a few implementations with the two fastest being:

nn = 9000
nt = 24
h = rand(nn, nt)
m1() = sum((h .- hcat(h, h[:, 1])[:, 2:end]) .^ 2) / nn
m2() = sum((vec(h) - [h[nn+1:end]; h[1:nn]]) .^ 2) / nn

However, both of these take around 280 \mu s and use ~5MiB of memory through ~20 allocations according to benchmark tools. I appreciate these numbers are already pretty good, but as I’m going to be running this calculation millions of times I thought I’d ask are there any ways I can improve on my current performance? I’m somewhat new to Julia and this is the first piece of performance sensitive code I have so this as much about the learning as it is about the performance for me.

edit 1: I should point out this code will act as an objective function for an NLP problem in JuMP.
edit 2: typo in equation

Read the performance tips:

  1. Don’t use global variables. (Your code is in a function, but your variables are globals — pass them as arguments.)
  2. Think about array allocations. Slices allocate arrays unless you use @views. hcat and [array1; array2] allocate new arrays. Allocating an array just to sum it is wasteful too.
  3. Don’t be afraid of writing your own loops, especially for performance-critical code. Loops are fast in Julia, and are often the easiest way of avoiding extraneous array allocations.
1 Like

For example, the following function is 20x faster than your original code on my machine (thanks in a large part to the LoopVectorization.jl package), is arguably clearer (it’s a lot closer to your original formula), and performs no allocations:

using LoopVectorization
function m(h)
    nn, nt = size(h)
    s = zero(eltype(h))
    @turbo for i = 1:nn
        s += (h[i,1] - h[i,nt])^2
    @turbo for t = 2:nt, i = 1:nn
        s += (h[i,t] - h[i,t-1])^2
    return s / nn

Benchmark with:

using BenchmarkTools
@btime m($h);

(Even if you remove LoopVectorization’s @turbo annotation, it is still more than 2x faster than your original code, and still allocates no memory.)


I thought I was doing something wasteful with the slicing but didn’t think of @views.
I originally implemented it as a loop very similar to this, however, mine was far slower. In my actual implementation I’m not using globals, however, for some strange reason I choose to use them here; this seems to be a major bottleneck in my benchmarking. Thanks for your solution, that is a huge speed improvement - I wasn’t aware of the LoopVectorisation package, I’ll have a read into it.

(Maybe your loops were in the wrong order? Or maybe you initialized the sum to s = 0 instead of using zero(eltype(x)), which is type-unstable? For explicit loops, it’s also especially critical to avoid globals.)

I paid attention to the loop order, but I did initialise as s=0, I wasn’t aware this could be an issue. I think I need to re-read the performance tips.