Design of N dimensional structure for Particle

Hello Everyone,

I have a structure storing the position and velocity for a bunch of particles in N dimension space and functions for updating the position and velocity of particles:

struct Particles{N,T}
  pos::Vector{SVector{N,T}}
  vel::Vector{SVector{N,T}}
end

function updatePos!(p::Particls, dt)
  p.pos += p.vel*dt
end

function updateVel!(p::Particls, force::Function, dt)
  p.vel += force.(p.pos)*dt
end

I have two questions regarding to the design.

1. SVector{1,T} or T for Particles{1,T} ?

For one dimension case, should I store the data with scalar variable, like

struct Particles{1,T}
  pos::Vector{T}
  vel::Vector{T}
end

or just keep using SVector with only one element

struct Particles{1,T}
  pos::Vector{SVector{1,T}}
  vel::Vector{SVector{1,T}}
end

Is there any differences if performance is considered?

2. Structure of Array vs Array of Structure?

For this problem, will the performance be better if we use array of structure? e.g.

struct Particle{N,T}
  pos::SVector{N,T}
  vel::SVector{M,T}
end

function updateParticle!(p::Particle, force::Function, dt)
  pos = p.vel*dt
  vel  =  force(p.pos)*dt
  return Particle(pos, vel)
end

When we update position/velocity, we need the another quantity (velocity/position). It seems that the memory access can have better spatial locality?

Thanks in advance!

1 Like
  1. I think those are exactly the same, because the representation in memory of a static vector of one dimension will be a number. Edit: not exactly true (see below), the lowered codes still have one call to getindex in addition in comparison to the version with Vector{Float64} (I thought that the compiler would inline that).

  2. The performance of a particle simulation is dependent on the performance of the algorithm computing the interactions, generally involving a cutoff and using cell lists to avoid looping over particles that are far from each other. That makes the access to the vector of positions discontinuous in memory anyway from particle to particle. Because of that, the best strategy is to store a vector of static vectors.

(If the access was always continuous, a matrix would be the best strategy to take the most of simd operations, and that is the case of the update of positions given the forces, but the costly part will be the calculation of the forces).


Adding to the answer 1. Check this example:

julia> using StaticArrays

julia> x = rand(1000);

julia> y = [ SVector{1,Float64}(x[i]) for i in 1:1000 ];

julia> function mysum1(x) # with Vector{Float64}
         s = 0.
         @inbounds @simd for i in 1:length(x)
           s += sin(x[i])
         end
         s
       end
mysum1 (generic function with 1 method)

julia> function mysum2(x) # with Vector{SVector{1,Float64}}
         s = 0.
         @inbounds @simd for i in 1:length(x)
           s += sin(x[i][1])
         end
         s
       end
mysum2 (generic function with 1 method)

Functions mysum1 and mysum2 get lowered to the following codes, which differ by:

mysum1:

│   %28 = s
│   %29 = Base.getindex(x, i)
│   %30 = Main.sin(%29)

mysum2:

│   %28 = s
│   %29 = Base.getindex(x, i)
│   %30 = Base.getindex(%29, 1)
│   %31 = Main.sin(%30)   
Details
julia> @code_lowered mysum1(x)
CodeInfo(
1 ─       Core.NewvarNode(:(val))
│         s = 0.0
│         $(Expr(:inbounds, true))
│   %4  = Main.length(x)
│         r#273 = 1:%4
│   %6  = Base.simd_outer_range
│   %7  = (%6)(r#273)
│         @_5 = Base.iterate(%7)
│   %9  = @_5 === nothing
│   %10 = Base.not_int(%9)
└──       goto #8 if not %10
2 ┄ %12 = @_5
│         i#274 = Core.getfield(%12, 1)
│   %14 = Core.getfield(%12, 2)
│   %15 = Base.simd_inner_length
│   %16 = r#273
│         n#275 = (%15)(%16, i#274)
│   %18 = Main.zero(n#275)
│   %19 = %18 < n#275
└──       goto #6 if not %19
3 ─       i#276 = Main.zero(n#275)
4 ┄ %22 = i#276 < n#275
└──       goto #6 if not %22
5 ─ %24 = Base.simd_index
│   %25 = r#273
│   %26 = i#274
│         i = (%24)(%25, %26, i#276)
│   %28 = s
│   %29 = Base.getindex(x, i)
│   %30 = Main.sin(%29)
│         s = %28 + %30
│         i#276 = i#276 + 1
│         $(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))
└──       goto #4
6 ┄       @_5 = Base.iterate(%7, %14)
│   %36 = @_5 === nothing
│   %37 = Base.not_int(%36)
└──       goto #8 if not %37
7 ─       goto #2
8 ┄       val = Main.nothing
│         $(Expr(:inbounds, :pop))
│         val
└──       return s
)

julia> @code_lowered mysum2(y)
CodeInfo(
1 ─       Core.NewvarNode(:(val))
│         s = 0.0
│         $(Expr(:inbounds, true))
│   %4  = Main.length(x)
│         r#277 = 1:%4
│   %6  = Base.simd_outer_range
│   %7  = (%6)(r#277)
│         @_5 = Base.iterate(%7)
│   %9  = @_5 === nothing
│   %10 = Base.not_int(%9)
└──       goto #8 if not %10
2 ┄ %12 = @_5
│         i#278 = Core.getfield(%12, 1)
│   %14 = Core.getfield(%12, 2)
│   %15 = Base.simd_inner_length
│   %16 = r#277
│         n#279 = (%15)(%16, i#278)
│   %18 = Main.zero(n#279)
│   %19 = %18 < n#279
└──       goto #6 if not %19
3 ─       i#280 = Main.zero(n#279)
4 ┄ %22 = i#280 < n#279
└──       goto #6 if not %22
5 ─ %24 = Base.simd_index
│   %25 = r#277
│   %26 = i#278
│         i = (%24)(%25, %26, i#280)
│   %28 = s
│   %29 = Base.getindex(x, i)
│   %30 = Base.getindex(%29, 1)
│   %31 = Main.sin(%30)
│         s = %28 + %31
│         i#280 = i#280 + 1
│         $(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))
└──       goto #4
6 ┄       @_5 = Base.iterate(%7, %14)
│   %37 = @_5 === nothing
│   %38 = Base.not_int(%37)
└──       goto #8 if not %38
7 ─       goto #2
8 ┄       val = Main.nothing
│         $(Expr(:inbounds, :pop))
│         val
└──       return s
)

2 Likes

Adding up to the second answer. Consider this typical scenario, in which I am computing the sum of the square distances of 1000 particles in three dimensions.

I will define three functions and particle representations. One in which the coordinates of each particle is stored in each columns of a matrix. The other in which the coordinates are each row of the matrix (thus the long list of particles is col-wise), and a third using a vector of static 3D-vectors.

In the first test, I compute the sum of the distances of all 1000 particles generated, such that all matrices and vectors are contiguous in memory:

Code
using BenchmarkTools
using StaticArrays

function sumd_longline(M)
  N = size(M,2)
  s = 0.
  @inbounds for i in 1:N-1
    @simd for j in i+1:N
      s += (M[1,i]-M[1,j])^2 + (M[2,i]-M[2,j])^2 + (M[3,i]-M[3,j])^2
    end
  end
  s
end

function sumd_longcol(M)
  N = size(M,1)
  s = 0.
  @inbounds for i in 1:N-1
    @simd for j in i+1:N
      s += (M[i,1]-M[j,1])^2 + (M[i,2]-M[j,2])^2 + (M[i,3]-M[j,3])^2
    end
  end
  s
end

function sumd_staticvectors(x)
  N = length(x)
  s = 0.
  @inbounds for i in 1:N-1
    for j in i+1:N
      s += (x[i][1]-x[j][1])^2 + (x[i][2]-x[j][2])^2 + (x[i][2]-x[j][2])^2
    end
  end
  s
end

M1 = rand(3,1000);
print(" long line: ");@btime sumd_longline($M1)

M2 = rand(1000,3);
print(" long col: ");@btime sumd_longcol($M2)

x = [ rand(SVector{3,Float64}) for i in 1:1000 ];
print(" static vectors: ");@btime sumd_staticvectors($x)

The result is:

 long line:   666.828 Ξs (0 allocations: 0 bytes)
 long col:   230.356 Ξs (0 allocations: 0 bytes)
 static vectors:   663.359 Ξs (0 allocations: 0 bytes)

As expected, there is a huge benefit in using a storage with long columns wich are run over in the inner loop, taking advantage of simd operations.

However, as I mentioned, this is not the most typical situation in a particle simulation. You most likely will compute the distances (and interactions) between particles which have no reason to be contiguous in the memory, if you are using any algorithm that allows you to skip all computations for particles which are further away from each other than some cutoff.

In that case, a more realistic scenario is computing the sums above but between particle coordinates which are not contiguous in memory. The code below emulates that, by generating 10 thousand particles and computing the sum of squared distances of a random subset of 1000 of them:

Code
function sumd_longline(M,indexes)
  N = length(indexes)
  s = 0.
  @inbounds for i1 in 1:N-1
    @simd for j1 in i1+1:N
      i = indexes[i1]
      j = indexes[j1]
      s += (M[1,i]-M[1,j])^2 + (M[2,i]-M[2,j])^2 + (M[3,i]-M[3,j])^2
    end
  end
  s
end

function sumd_longcol(M,indexes)
  N = length(indexes)
  s = 0.
  @inbounds for i1 in 1:N-1
    @simd for j1 in i1+1:N
      i = indexes[i1]
      j = indexes[j1]
      s += (M[i,1]-M[j,1])^2 + (M[i,2]-M[j,2])^2 + (M[i,3]-M[j,3])^2
    end
  end
  s
end

function sumd_staticvectors(x,indexes)
  N = length(indexes)
  s = 0.
  @inbounds for i1 in 1:N-1
    for j1 in i1+1:N
      i = indexes[i1]
      j = indexes[j1]
      s += (x[i][1]-x[j][1])^2 + (x[i][2]-x[j][2])^2 + (x[i][2]-x[j][2])^2
    end
  end
  s
end

indexes = rand(1:10000,1000)

M1 = rand(3,10000)
print(" long line: ");@btime sumd_longline($M1,$indexes)

M2 = rand(10000,3)
print(" long col: ");@btime sumd_longcol($M2,$indexes)

x = [ rand(SVector{3,Float64}) for i in 1:10000 ];
print(" static vectors: ");@btime sumd_staticvectors($x,$indexes)

The result is:

 long line:   962.355 Ξs (0 allocations: 0 bytes)
 long col:   1.001 ms (0 allocations: 0 bytes)
 static vectors:   785.297 Ξs (0 allocations: 0 bytes)

Since the access to the arrays is not anymore contiguous in memory in any of the cases, simd operations do not provide any performance gain. And, now, the structure of a vector of static vectors is the best one. One could consider the possibility of copying the subarrays into contiguous arrays before doing the computations, but I think that is not worth the effort in most cases.

I am very interested in any deeper discussion on these possibilities, if any one has anything to add.

2 Likes

Thank you so much for your explain and analysis!

May I ask further why the inner loop of sumd_longcol in the first set of benchmarks can take advantage of simd? I have very limited understanding on simd. ( only something like figure 2 in the post Basics of SIMD Programming )

The wikipedia article is a good place to start:Single instruction, multiple data - Wikipedia

Talking into account that Julia is column-major: Row- and column-major order - Wikipedia