Accelerating an N^2 nested loop

Given a fixed integer N, and some arrays \{ x_{i} , y_{i} \}_{i}, I am interested in (repeatedly) computing expressions of the form
\dot{x}_{i} = - \sum_{j=1}^{N} (y_{i} \!-\! y_{j})/d_{ij}; \;\; \dot{y}_{i} = \sum_{j=1}^{N} (x_{i} \!-\! x_{j})/d_{ij}\; with \; d_{ij} = (x_{i} \!-\! x_{j})^{2} \!+\! (y_{i} \!-\! y_{j})^{2} \!+\! \epsilon .

Here is the best code I could come up with

using BenchmarkTools
const N = 512; const EPS = 0.1
const X, Y = rand(N), rand(N)
const X_DOT, Y_DOT = zeros(Float64,N), zeros(Float64,N)
##################################################
@fastmath function XY_DOT!()
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i=1:N
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @inbounds for j=(i+1):N
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
            dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
end

When benchmarked on my laptop, this gives

@btime XY_DOT!()
-->   260.093 μs (0 allocations: 0 bytes)

I tried various approaches to try and further accelerate the code, in particular (i) moving around the affectations using a “square” domain of summation; or (ii) using libraries like LoopVectorization.jl, MuladdMacro.jl. But to no avail. At this stage, I am also not keen on multi-threading the code (using @batch from Polyester.jl) to rather perform independent simulations in parallel.

Can this serial code be further sped-up?

I can’t :slight_smile:

But at least I can perhaps suggest some styling difference. The following version has the same speed, without requiring constant N nor @inbounds, (and passing the variables as parameters, which is usually easier for debugging):

julia> ##################################################
       function XY_DOT2!(X,Y,X_DOT,Y_DOT,EPS)
           fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
           for i in eachindex(X,Y,X_DOT,Y_DOT)
               x_i, y_i = X[i], Y[i]
               acc_x, acc_y = 0.0, 0.0
               @fastmath for j=(i+1):lastindex(X)
                   x_j, y_j = X[j], Y[j]
                   diff_x = x_i - x_j; diff_y = y_i - y_j
                   inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
                   dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
                   acc_x += dxdt; acc_y += dydt
                   X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
               end
               X_DOT[i] += acc_x; Y_DOT[i] += acc_y
           end
       end
XY_DOT2! (generic function with 1 method)

julia> @btime XY_DOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS)
  65.051 μs (0 allocations: 0 bytes)

julia> @btime XY_DOT!()
  64.719 μs (0 allocations: 0 bytes)

(edited the post because I had copied the wrong code there)

Another possible suggestion (not performance-wise) would be to use static arrays all the way. That would simplify the code, although here I turned out to be somewhat slower, I’m not sure why (one should not expect that, there might be some difference I’m not seeing):

julia> function XY_DOT_SV!(P::Vector{T}, XY_DOT::Vector{T}, EPS) where T<:SVector
           fill!(XY_DOT, zero(T))
           @inbounds for i in eachindex(P,XY_DOT)
               p_i = P[i] 
               acc = zero(T)
               @fastmath for j=(i+1):lastindex(P)
                   diff = P[j] - p_i
                   invd = one(eltype(diff)) / (diff[1]^2 + diff[2]^2 + EPS)
                   dpdt = diff * invd
                   acc += dpdt
                   XY_DOT[j] += dpdt
               end
               XY_DOT[i] -= acc
           end
           return XY_DOT
       end

julia> @btime XY_DOT_SV!($P, $XYDOT, $EPS)
  116.680 μs (0 allocations: 0 bytes)

One advantage of this approach is that you would be able to run it in other dimensions just as easily:

julia> XYDOT = zeros(SVector{3,Float64},N);

julia> P = rand(SVector{3,Float64},N);

julia> @btime XY_DOT_SV!($P, $XYDOT, $EPS)
  175.881 μs (0 allocations: 0 bytes)

(this last code must be checked for correctness, in any case, I didn´t test it)

You can make some improvements as @lmiq suggested, and also make the code work with different types (like Float32):

function XY_DOT3!(X,Y,X_DOT,Y_DOT,EPS)
    el_zero = zero(eltype(X))
    fill!(X_DOT,el_zero); fill!(Y_DOT,el_zero)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = el_zero, el_zero
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_j - y_i
            inv_d = inv(diff_x*diff_x + diff_y * diff_y + EPS)
            dxdt = diff_y * inv_d; dydt = diff_x * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
end

I also changed diff_y to be y_j - y_i and removed the - later on. I also switched to inv and * instead of 1.0/ and ^, although this had very little effect on performance - you don’t really need these changes.

For benchmarking:

T = Float32
N = 512; EPS = convert(T, 0.1)
X, Y = rand(T, N), rand(T, N)
X_DOT, Y_DOT = zeros(T,N), zeros(T,N)
julia> using BenchmarkTools
julia> @benchmark XY_DOT2!($X, $Y, $X_DOT, $Y_DOT, $EPS)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  219.100 μs … 318.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     219.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   220.527 μs ±   6.480 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁    ▁                                                      ▁
  ███▅▄▅▆██▇▆▄▄▃▁▄▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇ █
  219 μs        Histogram: log(frequency) by time        270 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark XY_DOT3!($X, $Y, $X_DOT, $Y_DOT, $EPS)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  81.600 μs … 162.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     82.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.303 μs ±   3.842 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▂
  ▂▅███▄▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  81.6 μs         Histogram: frequency by time         87.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The performance is essentially the same when using Float64 instead, but since you are using @fastmath you may not need the precision of Float64.

I used to believe that static arrays were only advised for sufficiently small arrays, say of length <=50?

Thank you very much for all these elements, this is really helpful. Coming from a Fortran background, I do tend to over-use global constant objects :wink:

For example, am I correct in understanding that writing

allows for more versatile/generic code but plays no role regarding performance?

This is an array of small-arrays:

julia> using StaticArrays

julia> zeros(SVector{2,Float64},3)
3-element Vector{SVector{2, Float64}}:
 [0.0, 0.0]
 [0.0, 0.0]
 [0.0, 0.0]

It is usually the more convenient way to represent sets of coordinates.

Yes, and much easier to optimize and debug, because the function is self-contained (a good practice in Fortran too :slight_smile: ).

Thank you very much for these suggestions. Unfortunately, using a floating representation of typicall “error” 1e-7 will bring, I fear, too large errors down the road. Using @fastmath with Float64 only degraded the signal by 1e-15 which is acceptable for my particular propose.

In practice, my concern regarding performance is more about (i) should I iterate over a triangular (i,j) domain or rather a square one; (ii) is it possible to reduce the number of affectations; (iii) are there better memory layouts to use for this calculation; (iv) or a more efficient way to compute the “interaction kernel” inv(diff_x*diff_x+diff_y*diff_y+EPS) etc.?

What you are doing there is fine in all these aspects. I don’t think you can get much faster than that if you need to run over all pairs of particles, and without parallelization. If there is a cutoff over which interactions can be ignored, than it is another story.

Is it obvious, a priori, that an array of (tiny) static arrays is better (performance-wise) than a simple 2D array of the form zeros(Float64,2,N)?

No, they are exactly the same. It just convenient in terms of writing the operations between points.

vector{SVector} can be faster because it tells the compiler the size of the inner dimension which can allow it to unroll things differently.

inv is great, because it’s more type generic and more concise. But diff_x * diff_x is not an improvement over diff_x^2, since x^2 gets transformed to x*x:

julia> foo(x) = x^2;

julia> @code_llvm debuginfo=:none foo(1.0)
; Function Attrs: uwtable
define double @julia_foo_548(double %0) #0 {
top:
  %1 = fmul double %0, %0
  ret double %1
}

In other words, x^2, being more readable and identically performant, is preferable.

I generally don’t expect to be faster, but identical. Yet, in this particular case, the SVector version is turning out to be slower (~30%), and I don’t understand why. Do you see the difference? Here is a copy/paste runnable code of both versions. Just run main().

I get:

julia> main()
  66.126 μs (0 allocations: 0 bytes)
nothing
  101.776 μs (0 allocations: 0 bytes)
nothing
import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
using Test
using StaticArrays

# With simple 1D arrays
function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = inv(diff_x^2 + diff_y^2 + EPS)
            dxdt = diff_x * inv_d; dydt = diff_y * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
    return nothing
end

# with vectors of SVectors
function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector
    fill!(XYDOT, zero(T))
    @inbounds for i in eachindex(P,XYDOT)
        p_i = P[i] 
        acc = zero(T)
        @fastmath for j=(i+1):lastindex(P)
            diff = p_i - P[j]
            invd = inv(diff[1]^2 + diff[2]^2 + EPS)
            dpdt = diff * invd
            acc += dpdt
            XYDOT[j] -= dpdt
        end
        XYDOT[i] += acc
    end
    return nothing
end

function main()
    N = 512
    EPS = 0.1
    P = rand(SVector{2,Float64},N)
    XYDOT = similar(P)
    X = [ x[1] for x in P ];
    Y = [ x[2] for x in P ];
    X_DOT = similar(X)
    Y_DOT = similar(Y)

    XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)   
    XYDOT_SV!(P,XYDOT,EPS)
    @test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT))
     
    display(@btime XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS))
    display(@btime XYDOT_SV!($P,$XYDOT,$EPS))

    return nothing
end