Vector local to loop iteration vs. Static Vector

In simple the example below, I write two functions that compute “the force” between the particles of a set. To compute the forces between each pair of particles one needs to compute the difference in position of the two particles (the dx vector).

I have implemented two versions of the minimal example: one in which the dx vector is just computed naively, the other in which it is computed as a static array. There is no other difference in the codes.

As expected, the static array version does not allocate, and is faster. Yet, that left me a bit unsatisfied. The vector dx is local to the inner iteration of the loop. I had the expectation that the compiler would figure that out and stack-allocate it as well in the “naive” case. It does not, and the difference in performance is huge.

Forces dx naive:   23.923 ms (499500 allocations: 53.35 MiB)
Forces dx static:   4.397 ms (0 allocations: 0 bytes)

Here is the code:

Code
using BenchmarkTools

function forces_naive!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx = x[i] - x[j] # naive, allocates
      f[i] .= f[i] .+ dx
      f[j] .= f[j] .- dx
    end
  end
end

using StaticArrays
function forces_static!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx = SVector{3,Float64}(x[i][1] - x[j][1],
                              x[i][2] - x[j][2],
                              x[i][3] - x[j][3])
      f[i] .= f[i] .+ dx
      f[j] .= f[j] .- dx
    end
  end
end

N = 1000
f = [ zeros(3) for i in 1:N ]
x = [ rand(3) for i in 1:N ]

print("Forces dx naive: ");@btime forces_naive!($x,$f)

print("Forces dx static: ");@btime forces_static!($x,$f)

My more general point is: is it unrealistic to expect that the compiler does something better with that vector, even if one does not explicitly say that it is static? Isn’t this a case where one could expect that?

(as a provocative remark, I think this occurs in a code compiled with the @jit flag in some python implementation of a similar thing - that is where this comes from - not that I know the details of the python version, but while we can get a faster Julia version using static arrays, tentative “naive” implementations like this turn out to be faster in the jit compiled python version, and very slow in the Julia version, and I think this is one of the reasons).

The more specific question is: is there a prettier syntax to write this?

      dx = SVector{3,Float64}(x[i][1] - x[j][1],
                              x[i][2] - x[j][2],
                              x[i][3] - x[j][3])

I mean, it wouldn’t be bad that, for example @SVector x[i]-x[j] got interpreted as that, for example.

2 Likes

I would probably just create arrays of SVectors. That improves the performance again and makes the syntax look much nicer in my opinion.

julia> using BenchmarkTools, StaticArrays

julia> function forces_naive!(f, x)
         n = length(x)
         for i in 1:n-1
           for j in i+1:n
             dx = x[i] - x[j] # naive, allocates
             f[i] .= f[i] .+ dx
             f[j] .= f[j] .- dx
           end
         end
       end

julia> function forces_static!(f, x)
         n = length(x)
         for i in 1:n-1
           for j in i+1:n
             dx = SVector{3,Float64}(x[i][1] - x[j][1],
                                     x[i][2] - x[j][2],
                                     x[i][3] - x[j][3])
             f[i] .= f[i] .+ dx
             f[j] .= f[j] .- dx
           end
         end
       end

julia> N = 1000

julia> f = [zeros(3) for i in 1:N]; x = [rand(3) for i in 1:N];

julia> @benchmark forces_naive!($f, $x)
Forces dx naive: BenchmarkTools.Trial:
  memory estimate:  53.35 MiB
  allocs estimate:  499500
  --------------
  minimum time:     18.641 ms (1.34% GC)
  median time:      19.122 ms (1.50% GC)
  mean time:        19.363 ms (1.86% GC)
  maximum time:     29.416 ms (1.51% GC)
  --------------
  samples:          259
  evals/sample:     1

julia> @benchmark forces_static!($f, $x)
Forces dx static: BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.671 ms (0.00% GC)
  median time:      3.794 ms (0.00% GC)
  mean time:        3.879 ms (0.00% GC)
  maximum time:     7.155 ms (0.00% GC)
  --------------
  samples:          1289
  evals/sample:     1

julia> function forces_static2!(f, x)
           n = length(x)
           for i in 1:n-1
               for j in i+1:n
                   dx = x[i] - x[j]
                   f[i] = f[i] + dx
                   f[j] = f[j] - dx
               end
           end
       end

julia> f2 = [@SVector zeros(3) for _ in 1:N]; x2 = [@SVector rand(3) for _ in 1:N];

julia> @benchmark forces_static2!($f2, $x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     969.887 μs (0.00% GC)
  median time:      971.940 μs (0.00% GC)
  mean time:        984.720 μs (0.00% GC)
  maximum time:     1.780 ms (0.00% GC)
  --------------
  samples:          5077
  evals/sample:     1
4 Likes

The compiler cannot allocate dx on stack because it cannot be sure that it always has the same size based on types of the arguments.
You may allocate it in advance and update as dx .= x[i] .- x[j].
But for max performance you’d need to store coords and forces as SVectors or 3-tuples.

The JIT you mention might be a tracing JIT, meaning that it “notices” that dx always has the same size and then generates code under that assumption, but ready to de-optimize if it breaks. Julia doesn’t use runtime information for compilation AFAIK.

5 Likes
dx = @SVector [x[i][k] - x[j][k] for k in 1:3]

Edit: removed the unnecessary dot inside the comprehension.

4 Likes

@ranocha , yes, that is the best way to do it, but I was wondering about the cost of not using static arrays.

@DNF , thanks, I thought that allocated a vector before creating the static one. @SVector x[i]-x[j] cannot do the same because of the same reasons @Vasily_Pisarev pointed, right?

1 Like

No, it doesn’t allocate. The macro rewrites the comprehension.

(I just realized that there was a dot on the - which was not needed, and was possibly confusing.)

6 Likes

You could also use SIMD.jl, which has no allocations:

using SIMD
function forces_simd!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx    = x[i] - x[j]
      f[i] += dx
      f[j] -= dx
    end
  end
end

N  = 1000
_f = [Vec{3, Float64}(0) for i in 1:N]
_x = [Vec{3, Float64}((randn(), randn(), randn()))  for i in 1:N]
print("Forces dx simd: ")
@btime forces_simd!($_x,$_f)

I didn’t check the speed versus StaticArrays. I just think SIMD.jl is cool and was curious if this would run with no allocations, which it does.

3 Likes