Fast offset matrix subtraction

I have a matrix h \in \mathbb{R}^{n_n \times n_t} where n_n is the number of sensors and n_t is the number of hours in a day. I want to understand the variation between each timestamp for each sensor (note times roll over between days), so I have:
\frac{1}{n_{n}} (\sum_{i=1}^{n_n} (h_i^1-h_i^{n_t})^2+\sum_{t=2}^{n_t}\sum_{i=1}^{n_n} (h_i^t-h_i^{t-1})^2)

I’ve tried a few implementations with the two fastest being:

nn = 9000
nt = 24
h = rand(nn, nt)
m1() = sum((h .- hcat(h, h[:, 1])[:, 2:end]) .^ 2) / nn
m2() = sum((vec(h) - [h[nn+1:end]; h[1:nn]]) .^ 2) / nn

However, both of these take around 280 \mu s and use ~5MiB of memory through ~20 allocations according to benchmark tools. I appreciate these numbers are already pretty good, but as I’m going to be running this calculation millions of times I thought I’d ask are there any ways I can improve on my current performance? I’m somewhat new to Julia and this is the first piece of performance sensitive code I have so this as much about the learning as it is about the performance for me.

edit 1: I should point out this code will act as an objective function for an NLP problem in JuMP.
edit 2: typo in equation

Read the performance tips:

  1. Don’t use global variables. (Your code is in a function, but your variables are globals — pass them as arguments.)
  2. Think about array allocations. Slices allocate arrays unless you use @views. hcat and [array1; array2] allocate new arrays. Allocating an array just to sum it is wasteful too.
  3. Don’t be afraid of writing your own loops, especially for performance-critical code. Loops are fast in Julia, and are often the easiest way of avoiding extraneous array allocations.

For example, the following function is 20x faster than your original code on my machine (thanks in a large part to the LoopVectorization.jl package), is arguably clearer (it’s a lot closer to your original formula), and performs no allocations:

using LoopVectorization
function m(h)
    nn, nt = size(h)
    s = zero(eltype(h))
    @turbo for i = 1:nn
        s += (h[i,1] - h[i,nt])^2
    end
    @turbo for t = 2:nt, i = 1:nn
        s += (h[i,t] - h[i,t-1])^2
    end
    return s / nn
end

Benchmark with:

using BenchmarkTools
@btime m($h);

(Even if you remove LoopVectorization’s @turbo annotation, it is still more than 2x faster than your original code, and still allocates no memory.)

I thought I was doing something wasteful with the slicing but didn’t think of @views.
I originally implemented it as a loop very similar to this, however, mine was far slower. In my actual implementation I’m not using globals, however, for some strange reason I choose to use them here; this seems to be a major bottleneck in my benchmarking. Thanks for your solution, that is a huge speed improvement - I wasn’t aware of the LoopVectorisation package, I’ll have a read into it.
Cheers

(Maybe your loops were in the wrong order? Or maybe you initialized the sum to s = 0 instead of using zero(eltype(x)), which is type-unstable? For explicit loops, it’s also especially critical to avoid globals.)

I paid attention to the loop order, but I did initialise as s=0, I wasn’t aware this could be an issue. I think I need to re-read the performance tips.