Accelerating an N^2 nested loop

Given a fixed integer N, and some arrays \{ x_{i} , y_{i} \}_{i}, I am interested in (repeatedly) computing expressions of the form
\dot{x}_{i} = - \sum_{j=1}^{N} (y_{i} \!-\! y_{j})/d_{ij}; \;\; \dot{y}_{i} = \sum_{j=1}^{N} (x_{i} \!-\! x_{j})/d_{ij}\; with \; d_{ij} = (x_{i} \!-\! x_{j})^{2} \!+\! (y_{i} \!-\! y_{j})^{2} \!+\! \epsilon .

Here is the best code I could come up with

using BenchmarkTools
const N = 512; const EPS = 0.1
const X, Y = rand(N), rand(N)
const X_DOT, Y_DOT = zeros(Float64,N), zeros(Float64,N)
##################################################
@fastmath function XY_DOT!()
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i=1:N
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @inbounds for j=(i+1):N
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
            dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
end

When benchmarked on my laptop, this gives

@btime XY_DOT!()
-->   260.093 μs (0 allocations: 0 bytes)

I tried various approaches to try and further accelerate the code, in particular (i) moving around the affectations using a “square” domain of summation; or (ii) using libraries like LoopVectorization.jl, MuladdMacro.jl. But to no avail. At this stage, I am also not keen on multi-threading the code (using @batch from Polyester.jl) to rather perform independent simulations in parallel.

Can this serial code be further sped-up?

2 Likes

I can’t :slight_smile:

But at least I can perhaps suggest some styling difference. The following version has the same speed, without requiring constant N nor @inbounds, (and passing the variables as parameters, which is usually easier for debugging):

julia> ##################################################
       function XY_DOT2!(X,Y,X_DOT,Y_DOT,EPS)
           fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
           for i in eachindex(X,Y,X_DOT,Y_DOT)
               x_i, y_i = X[i], Y[i]
               acc_x, acc_y = 0.0, 0.0
               @fastmath for j=(i+1):lastindex(X)
                   x_j, y_j = X[j], Y[j]
                   diff_x = x_i - x_j; diff_y = y_i - y_j
                   inv_d = 1.0/(diff_x^2 + diff_y^2 + EPS)
                   dxdt = -diff_y * inv_d; dydt = diff_x * inv_d
                   acc_x += dxdt; acc_y += dydt
                   X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
               end
               X_DOT[i] += acc_x; Y_DOT[i] += acc_y
           end
       end
XY_DOT2! (generic function with 1 method)

julia> @btime XY_DOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS)
  65.051 μs (0 allocations: 0 bytes)

julia> @btime XY_DOT!()
  64.719 μs (0 allocations: 0 bytes)

(edited the post because I had copied the wrong code there)

Another possible suggestion (not performance-wise) would be to use static arrays all the way. That would simplify the code, although here I turned out to be somewhat slower, I’m not sure why (one should not expect that, there might be some difference I’m not seeing):

julia> function XY_DOT_SV!(P::Vector{T}, XY_DOT::Vector{T}, EPS) where T<:SVector
           fill!(XY_DOT, zero(T))
           @inbounds for i in eachindex(P,XY_DOT)
               p_i = P[i] 
               acc = zero(T)
               @fastmath for j=(i+1):lastindex(P)
                   diff = P[j] - p_i
                   invd = one(eltype(diff)) / (diff[1]^2 + diff[2]^2 + EPS)
                   dpdt = diff * invd
                   acc += dpdt
                   XY_DOT[j] += dpdt
               end
               XY_DOT[i] -= acc
           end
           return XY_DOT
       end

julia> @btime XY_DOT_SV!($P, $XYDOT, $EPS)
  116.680 μs (0 allocations: 0 bytes)

One advantage of this approach is that you would be able to run it in other dimensions just as easily:

julia> XYDOT = zeros(SVector{3,Float64},N);

julia> P = rand(SVector{3,Float64},N);

julia> @btime XY_DOT_SV!($P, $XYDOT, $EPS)
  175.881 μs (0 allocations: 0 bytes)

(this last code must be checked for correctness, in any case, I didn´t test it)

You can make some improvements as @lmiq suggested, and also make the code work with different types (like Float32):

function XY_DOT3!(X,Y,X_DOT,Y_DOT,EPS)
    el_zero = zero(eltype(X))
    fill!(X_DOT,el_zero); fill!(Y_DOT,el_zero)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = el_zero, el_zero
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_j - y_i
            inv_d = inv(diff_x*diff_x + diff_y * diff_y + EPS)
            dxdt = diff_y * inv_d; dydt = diff_x * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
end

I also changed diff_y to be y_j - y_i and removed the - later on. I also switched to inv and * instead of 1.0/ and ^, although this had very little effect on performance - you don’t really need these changes.

For benchmarking:

T = Float32
N = 512; EPS = convert(T, 0.1)
X, Y = rand(T, N), rand(T, N)
X_DOT, Y_DOT = zeros(T,N), zeros(T,N)
julia> using BenchmarkTools
julia> @benchmark XY_DOT2!($X, $Y, $X_DOT, $Y_DOT, $EPS)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  219.100 μs … 318.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     219.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   220.527 μs ±   6.480 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▁    ▁                                                      ▁
  ███▅▄▅▆██▇▆▄▄▃▁▄▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇ █
  219 μs        Histogram: log(frequency) by time        270 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark XY_DOT3!($X, $Y, $X_DOT, $Y_DOT, $EPS)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  81.600 μs … 162.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     82.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.303 μs ±   3.842 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▂
  ▂▅███▄▂▂▁▁▁▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  81.6 μs         Histogram: frequency by time         87.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The performance is essentially the same when using Float64 instead, but since you are using @fastmath you may not need the precision of Float64.

I used to believe that static arrays were only advised for sufficiently small arrays, say of length <=50?

Thank you very much for all these elements, this is really helpful. Coming from a Fortran background, I do tend to over-use global constant objects :wink:

For example, am I correct in understanding that writing

allows for more versatile/generic code but plays no role regarding performance?

This is an array of small-arrays:

julia> using StaticArrays

julia> zeros(SVector{2,Float64},3)
3-element Vector{SVector{2, Float64}}:
 [0.0, 0.0]
 [0.0, 0.0]
 [0.0, 0.0]

It is usually the more convenient way to represent sets of coordinates.

Yes, and much easier to optimize and debug, because the function is self-contained (a good practice in Fortran too :slight_smile: ).

Thank you very much for these suggestions. Unfortunately, using a floating representation of typicall “error” 1e-7 will bring, I fear, too large errors down the road. Using @fastmath with Float64 only degraded the signal by 1e-15 which is acceptable for my particular propose.

In practice, my concern regarding performance is more about (i) should I iterate over a triangular (i,j) domain or rather a square one; (ii) is it possible to reduce the number of affectations; (iii) are there better memory layouts to use for this calculation; (iv) or a more efficient way to compute the “interaction kernel” inv(diff_x*diff_x+diff_y*diff_y+EPS) etc.?

1 Like

What you are doing there is fine in all these aspects. I don’t think you can get much faster than that if you need to run over all pairs of particles, and without parallelization. If there is a cutoff over which interactions can be ignored, than it is another story.

Is it obvious, a priori, that an array of (tiny) static arrays is better (performance-wise) than a simple 2D array of the form zeros(Float64,2,N)?

No, they are exactly the same. It just convenient in terms of writing the operations between points.

vector{SVector} can be faster because it tells the compiler the size of the inner dimension which can allow it to unroll things differently.

1 Like

inv is great, because it’s more type generic and more concise. But diff_x * diff_x is not an improvement over diff_x^2, since x^2 gets transformed to x*x:

julia> foo(x) = x^2;

julia> @code_llvm debuginfo=:none foo(1.0)
; Function Attrs: uwtable
define double @julia_foo_548(double %0) #0 {
top:
  %1 = fmul double %0, %0
  ret double %1
}

In other words, x^2, being more readable and identically performant, is preferable.

1 Like

I generally don’t expect to be faster, but identical. Yet, in this particular case, the SVector version is turning out to be slower (~30%), and I don’t understand why. Do you see the difference? Here is a copy/paste runnable code of both versions. Just run main().

I get:

julia> main()
  66.126 μs (0 allocations: 0 bytes)
nothing
  101.776 μs (0 allocations: 0 bytes)
nothing
import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
using Test
using StaticArrays

# With simple 1D arrays
function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = inv(diff_x^2 + diff_y^2 + EPS)
            dxdt = diff_x * inv_d; dydt = diff_y * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
    return nothing
end

# with vectors of SVectors
function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector
    fill!(XYDOT, zero(T))
    @inbounds for i in eachindex(P,XYDOT)
        p_i = P[i] 
        acc = zero(T)
        @fastmath for j=(i+1):lastindex(P)
            diff = p_i - P[j]
            invd = inv(diff[1]^2 + diff[2]^2 + EPS)
            dpdt = diff * invd
            acc += dpdt
            XYDOT[j] -= dpdt
        end
        XYDOT[i] += acc
    end
    return nothing
end

function main()
    N = 512
    EPS = 0.1
    P = rand(SVector{2,Float64},N)
    XYDOT = similar(P)
    X = [ x[1] for x in P ];
    Y = [ x[2] for x in P ];
    X_DOT = similar(X)
    Y_DOT = similar(Y)

    XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)   
    XYDOT_SV!(P,XYDOT,EPS)
    @test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT))
     
    display(@btime XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS))
    display(@btime XYDOT_SV!($P,$XYDOT,$EPS))

    return nothing
end