For example, the following function is 20x faster than your original code on my machine (thanks in a large part to the LoopVectorization.jl package), is arguably clearer (it’s a lot closer to your original formula), and performs no allocations:
using LoopVectorization
function m(h)
nn, nt = size(h)
s = zero(eltype(h))
@turbo for i = 1:nn
s += (h[i,1] - h[i,nt])^2
end
@turbo for t = 2:nt, i = 1:nn
s += (h[i,t] - h[i,t-1])^2
end
return s / nn
end
Benchmark with:
using BenchmarkTools
@btime m($h);
(Even if you remove LoopVectorization’s @turbo annotation, it is still more than 2x faster than your original code, and still allocates no memory.)