I have a matrix h \in \mathbb{R}^{n_n \times n_t} where n_n is the number of sensors and n_t is the number of hours in a day. I want to understand the variation between each timestamp for each sensor (note times roll over between days), so I have:
\frac{1}{n_{n}} (\sum_{i=1}^{n_n} (h_i^1-h_i^{n_t})^2+\sum_{t=2}^{n_t}\sum_{i=1}^{n_n} (h_i^t-h_i^{t-1})^2)
I’ve tried a few implementations with the two fastest being:
nn = 9000
nt = 24
h = rand(nn, nt)
m1() = sum((h .- hcat(h, h[:, 1])[:, 2:end]) .^ 2) / nn
m2() = sum((vec(h) - [h[nn+1:end]; h[1:nn]]) .^ 2) / nn
However, both of these take around 280 \mu s and use ~5MiB of memory through ~20 allocations according to benchmark tools. I appreciate these numbers are already pretty good, but as I’m going to be running this calculation millions of times I thought I’d ask are there any ways I can improve on my current performance? I’m somewhat new to Julia and this is the first piece of performance sensitive code I have so this as much about the learning as it is about the performance for me.
edit 1: I should point out this code will act as an objective function for an NLP problem in JuMP.
edit 2: typo in equation
Read the performance tips:
- Don’t use global variables. (Your code is in a function, but your variables are globals — pass them as arguments.)
- Think about array allocations. Slices allocate arrays unless you use
@views
. hcat
and [array1; array2]
allocate new arrays. Allocating an array just to sum
it is wasteful too.
- Don’t be afraid of writing your own loops, especially for performance-critical code. Loops are fast in Julia, and are often the easiest way of avoiding extraneous array allocations.
1 Like
For example, the following function is 20x faster than your original code on my machine (thanks in a large part to the LoopVectorization.jl package), is arguably clearer (it’s a lot closer to your original formula), and performs no allocations:
using LoopVectorization
function m(h)
nn, nt = size(h)
s = zero(eltype(h))
@turbo for i = 1:nn
s += (h[i,1] - h[i,nt])^2
end
@turbo for t = 2:nt, i = 1:nn
s += (h[i,t] - h[i,t-1])^2
end
return s / nn
end
Benchmark with:
using BenchmarkTools
@btime m($h);
(Even if you remove LoopVectorization’s @turbo
annotation, it is still more than 2x faster than your original code, and still allocates no memory.)
4 Likes
I thought I was doing something wasteful with the slicing but didn’t think of @views
.
I originally implemented it as a loop very similar to this, however, mine was far slower. In my actual implementation I’m not using globals, however, for some strange reason I choose to use them here; this seems to be a major bottleneck in my benchmarking. Thanks for your solution, that is a huge speed improvement - I wasn’t aware of the LoopVectorisation package, I’ll have a read into it.
Cheers
(Maybe your loops were in the wrong order? Or maybe you initialized the sum to s = 0
instead of using zero(eltype(x))
, which is type-unstable? For explicit loops, it’s also especially critical to avoid globals.)
I paid attention to the loop order, but I did initialise as s=0
, I wasn’t aware this could be an issue. I think I need to re-read the performance tips.