Given a fixed integer N, and some arrays \{ x_{i} , y_{i} \}_{i}, I am interested in (repeatedly) computing expressions of the form

\dot{x}_{i} = - \sum_{j=1}^{N} (y_{i} \!-\! y_{j})/d_{ij}; \;\; \dot{y}_{i} = \sum_{j=1}^{N} (x_{i} \!-\! x_{j})/d_{ij}\; with \; d_{ij} = (x_{i} \!-\! x_{j})^{2} \!+\! (y_{i} \!-\! y_{j})^{2} \!+\! \epsilon .

Here is the best code I could come up with

```
using BenchmarkTools
const N = 512; const EPS = 0.1
const X, Y = rand(N), rand(N)
const X_DOT, Y_DOT = zeros(Float64,N), zeros(Float64,N)
##################################################
@fastmath function XY_DOT!()
fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
@inbounds for i=1:N
x_i, y_i = X[i], Y[i]
acc_x, acc_y = 0.0, 0.0
@inbounds for j=(i+1):N
x_j, y_j = X[j], Y[j]
diff_x = x_i - x_j; diff_y = y_i - y_j
inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
acc_x += dxdt; acc_y += dydt
X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
end
X_DOT[i] += acc_x; Y_DOT[i] += acc_y
end
end
```

When benchmarked on my laptop, this gives

```
@btime XY_DOT!()
--> 260.093 ΞΌs (0 allocations: 0 bytes)
```

I tried various approaches to try and further accelerate the code, in particular (i) moving around the affectations using a βsquareβ domain of summation; or (ii) using libraries like `LoopVectorization.jl`

, `MuladdMacro.jl`

. But to no avail. At this stage, I am also not keen on multi-threading the code (using `@batch`

from `Polyester.jl`

) to rather perform independent simulations in parallel.

Can this serial code be further sped-up?