# Accelerating an N^2 nested loop

Given a fixed integer N, and some arrays \{ x_{i} , y_{i} \}_{i}, I am interested in (repeatedly) computing expressions of the form
\dot{x}_{i} = - \sum_{j=1}^{N} (y_{i} \!-\! y_{j})/d_{ij}; \;\; \dot{y}_{i} = \sum_{j=1}^{N} (x_{i} \!-\! x_{j})/d_{ij}\; with \; d_{ij} = (x_{i} \!-\! x_{j})^{2} \!+\! (y_{i} \!-\! y_{j})^{2} \!+\! \epsilon .

Here is the best code I could come up with

using BenchmarkTools
const N = 512; const EPS = 0.1
const X, Y = rand(N), rand(N)
const X_DOT, Y_DOT = zeros(Float64,N), zeros(Float64,N)
##################################################
@fastmath function XY_DOT!()
fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
@inbounds for i=1:N
x_i, y_i = X[i], Y[i]
acc_x, acc_y = 0.0, 0.0
@inbounds for j=(i+1):N
x_j, y_j = X[j], Y[j]
diff_x = x_i - x_j; diff_y = y_i - y_j
inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
acc_x += dxdt; acc_y += dydt
X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
end
X_DOT[i] += acc_x; Y_DOT[i] += acc_y
end
end


When benchmarked on my laptop, this gives

@btime XY_DOT!()
-->   260.093 ΞΌs (0 allocations: 0 bytes)


I tried various approaches to try and further accelerate the code, in particular (i) moving around the affectations using a βsquareβ domain of summation; or (ii) using libraries like LoopVectorization.jl, MuladdMacro.jl. But to no avail. At this stage, I am also not keen on multi-threading the code (using @batch from Polyester.jl) to rather perform independent simulations in parallel.

Can this serial code be further sped-up?

2 Likes

I canβt

But at least I can perhaps suggest some styling difference. The following version has the same speed, without requiring constant N nor @inbounds, (and passing the variables as parameters, which is usually easier for debugging):

julia> ##################################################
function XY_DOT2!(X,Y,X_DOT,Y_DOT,EPS)
fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
for i in eachindex(X,Y,X_DOT,Y_DOT)
x_i, y_i = X[i], Y[i]
acc_x, acc_y = 0.0, 0.0
@fastmath for j=(i+1):lastindex(X)
x_j, y_j = X[j], Y[j]
diff_x = x_i - x_j; diff_y = y_i - y_j
inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
acc_x += dxdt; acc_y += dydt
X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
end
X_DOT[i] += acc_x; Y_DOT[i] += acc_y
end
end
XY_DOT2! (generic function with 1 method)

julia> @btime XY_DOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS) 65.051 ΞΌs (0 allocations: 0 bytes) julia> @btime XY_DOT!() 64.719 ΞΌs (0 allocations: 0 bytes)  (edited the post because I had copied the wrong code there) Another possible suggestion (not performance-wise) would be to use static arrays all the way. That would simplify the code, although here I turned out to be somewhat slower, Iβm not sure why (one should not expect that, there might be some difference Iβm not seeing): julia> function XY_DOT_SV!(P::Vector{T}, XY_DOT::Vector{T}, EPS) where T<:SVector fill!(XY_DOT, zero(T)) @inbounds for i in eachindex(P,XY_DOT) p_i = P[i] acc = zero(T) @fastmath for j=(i+1):lastindex(P) diff = P[j] - p_i invd = one(eltype(diff)) / (diff[1]^2 + diff[2]^2 + EPS) dpdt = diff * invd acc += dpdt XY_DOT[j] += dpdt end XY_DOT[i] -= acc end return XY_DOT end julia> @btime XY_DOT_SV!($P, $XYDOT,$EPS)
116.680 ΞΌs (0 allocations: 0 bytes)


One advantage of this approach is that you would be able to run it in other dimensions just as easily:

julia> XYDOT = zeros(SVector{3,Float64},N);

julia> P = rand(SVector{3,Float64},N);

julia> @btime XY_DOT_SV!($P,$XYDOT, $EPS) 175.881 ΞΌs (0 allocations: 0 bytes)  (this last code must be checked for correctness, in any case, I didnΒ΄t test it) You can make some improvements as @lmiq suggested, and also make the code work with different types (like Float32): function XY_DOT3!(X,Y,X_DOT,Y_DOT,EPS) el_zero = zero(eltype(X)) fill!(X_DOT,el_zero); fill!(Y_DOT,el_zero) @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT) x_i, y_i = X[i], Y[i] acc_x, acc_y = el_zero, el_zero @fastmath for j=(i+1):lastindex(X) x_j, y_j = X[j], Y[j] diff_x = x_i - x_j; diff_y = y_j - y_i inv_d = inv(diff_x*diff_x + diff_y * diff_y + EPS) dxdt = diff_y * inv_d; dydt = diff_x * inv_d acc_x += dxdt; acc_y += dydt X_DOT[j] -= dxdt; Y_DOT[j] -= dydt end X_DOT[i] += acc_x; Y_DOT[i] += acc_y end end  I also changed diff_y to be y_j - y_i and removed the - later on. I also switched to inv and * instead of 1.0/ and ^, although this had very little effect on performance - you donβt really need these changes. For benchmarking: T = Float32 N = 512; EPS = convert(T, 0.1) X, Y = rand(T, N), rand(T, N) X_DOT, Y_DOT = zeros(T,N), zeros(T,N)  julia> using BenchmarkTools julia> @benchmark XY_DOT2!($X, $Y,$X_DOT, $Y_DOT,$EPS)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  219.100 ΞΌs β¦ 318.600 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     219.700 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   220.527 ΞΌs Β±   6.480 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ    β                                                      β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
219 ΞΌs        Histogram: log(frequency) by time        270 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark XY_DOT3!($X,$Y, $X_DOT,$Y_DOT, $EPS) BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min β¦ max): 81.600 ΞΌs β¦ 162.500 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00% Time (median): 82.000 ΞΌs β GC (median): 0.00% Time (mean Β± Ο): 82.303 ΞΌs Β± 3.842 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00% βββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 81.6 ΞΌs Histogram: frequency by time 87.8 ΞΌs < Memory estimate: 0 bytes, allocs estimate: 0.  The performance is essentially the same when using Float64 instead, but since you are using @fastmath you may not need the precision of Float64. I used to believe that static arrays were only advised for sufficiently small arrays, say of length <=50? Thank you very much for all these elements, this is really helpful. Coming from a Fortran background, I do tend to over-use global constant objects For example, am I correct in understanding that writing allows for more versatile/generic code but plays no role regarding performance? This is an array of small-arrays: julia> using StaticArrays julia> zeros(SVector{2,Float64},3) 3-element Vector{SVector{2, Float64}}: [0.0, 0.0] [0.0, 0.0] [0.0, 0.0]  It is usually the more convenient way to represent sets of coordinates. Yes, and much easier to optimize and debug, because the function is self-contained (a good practice in Fortran too ). Thank you very much for these suggestions. Unfortunately, using a floating representation of typicall βerrorβ 1e-7 will bring, I fear, too large errors down the road. Using @fastmath with Float64 only degraded the signal by 1e-15 which is acceptable for my particular propose. In practice, my concern regarding performance is more about (i) should I iterate over a triangular (i,j) domain or rather a square one; (ii) is it possible to reduce the number of affectations; (iii) are there better memory layouts to use for this calculation; (iv) or a more efficient way to compute the βinteraction kernelβ inv(diff_x*diff_x+diff_y*diff_y+EPS) etc.? 1 Like What you are doing there is fine in all these aspects. I donβt think you can get much faster than that if you need to run over all pairs of particles, and without parallelization. If there is a cutoff over which interactions can be ignored, than it is another story. Is it obvious, a priori, that an array of (tiny) static arrays is better (performance-wise) than a simple 2D array of the form zeros(Float64,2,N)? No, they are exactly the same. It just convenient in terms of writing the operations between points. vector{SVector} can be faster because it tells the compiler the size of the inner dimension which can allow it to unroll things differently. 1 Like inv is great, because itβs more type generic and more concise. But diff_x * diff_x is not an improvement over diff_x^2, since x^2 gets transformed to x*x: julia> foo(x) = x^2; julia> @code_llvm debuginfo=:none foo(1.0) ; Function Attrs: uwtable define double @julia_foo_548(double %0) #0 { top: %1 = fmul double %0, %0 ret double %1 }  In other words, x^2, being more readable and identically performant, is preferable. 1 Like I generally donβt expect to be faster, but identical. Yet, in this particular case, the SVector version is turning out to be slower (~30%), and I donβt understand why. Do you see the difference? Here is a copy/paste runnable code of both versions. Just run main(). I get: julia> main() 66.126 ΞΌs (0 allocations: 0 bytes) nothing 101.776 ΞΌs (0 allocations: 0 bytes) nothing  import Pkg Pkg.activate(;temp=true) Pkg.add("StaticArrays") using Test using StaticArrays # With simple 1D arrays function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS) fill!(X_DOT,0.0); fill!(Y_DOT,0.0) @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT) x_i, y_i = X[i], Y[i] acc_x, acc_y = 0.0, 0.0 @fastmath for j=(i+1):lastindex(X) x_j, y_j = X[j], Y[j] diff_x = x_i - x_j; diff_y = y_i - y_j inv_d = inv(diff_x^2 + diff_y^2 + EPS) dxdt = diff_x * inv_d; dydt = diff_y * inv_d acc_x += dxdt; acc_y += dydt X_DOT[j] -= dxdt; Y_DOT[j] -= dydt end X_DOT[i] += acc_x; Y_DOT[i] += acc_y end return nothing end # with vectors of SVectors function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector fill!(XYDOT, zero(T)) @inbounds for i in eachindex(P,XYDOT) p_i = P[i] acc = zero(T) @fastmath for j=(i+1):lastindex(P) diff = p_i - P[j] invd = inv(diff[1]^2 + diff[2]^2 + EPS) dpdt = diff * invd acc += dpdt XYDOT[j] -= dpdt end XYDOT[i] += acc end return nothing end function main() N = 512 EPS = 0.1 P = rand(SVector{2,Float64},N) XYDOT = similar(P) X = [ x[1] for x in P ]; Y = [ x[2] for x in P ]; X_DOT = similar(X) Y_DOT = similar(Y) XYDOT2!(X,Y,X_DOT,Y_DOT,EPS) XYDOT_SV!(P,XYDOT,EPS) @test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT)) display(@btime XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS))
display(@btime XYDOT_SV!($P,$XYDOT,\$EPS))

return nothing
end