Vector local to loop iteration vs. Static Vector

lmiq · February 12, 2021, 1:44am

In simple the example below, I write two functions that compute “the force” between the particles of a set. To compute the forces between each pair of particles one needs to compute the difference in position of the two particles (the dx vector).

I have implemented two versions of the minimal example: one in which the dx vector is just computed naively, the other in which it is computed as a static array. There is no other difference in the codes.

As expected, the static array version does not allocate, and is faster. Yet, that left me a bit unsatisfied. The vector dx is local to the inner iteration of the loop. I had the expectation that the compiler would figure that out and stack-allocate it as well in the “naive” case. It does not, and the difference in performance is huge.

Forces dx naive:   23.923 ms (499500 allocations: 53.35 MiB)
Forces dx static:   4.397 ms (0 allocations: 0 bytes)

Here is the code:

Code

using BenchmarkTools

function forces_naive!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx = x[i] - x[j] # naive, allocates
      f[i] .= f[i] .+ dx
      f[j] .= f[j] .- dx
    end
  end
end

using StaticArrays
function forces_static!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx = SVector{3,Float64}(x[i][1] - x[j][1],
                              x[i][2] - x[j][2],
                              x[i][3] - x[j][3])
      f[i] .= f[i] .+ dx
      f[j] .= f[j] .- dx
    end
  end
end

N = 1000
f = [ zeros(3) for i in 1:N ]
x = [ rand(3) for i in 1:N ]

print("Forces dx naive: ");@btime forces_naive!($x,$f)

print("Forces dx static: ");@btime forces_static!($x,$f)

My more general point is: is it unrealistic to expect that the compiler does something better with that vector, even if one does not explicitly say that it is static? Isn’t this a case where one could expect that?

(as a provocative remark, I think this occurs in a code compiled with the @jit flag in some python implementation of a similar thing - that is where this comes from - not that I know the details of the python version, but while we can get a faster Julia version using static arrays, tentative “naive” implementations like this turn out to be faster in the jit compiled python version, and very slow in the Julia version, and I think this is one of the reasons).

The more specific question is: is there a prettier syntax to write this?

      dx = SVector{3,Float64}(x[i][1] - x[j][1],
                              x[i][2] - x[j][2],
                              x[i][3] - x[j][3])

I mean, it wouldn’t be bad that, for example @SVector x[i]-x[j] got interpreted as that, for example.

ranocha · February 12, 2021, 7:11am

I would probably just create arrays of SVectors. That improves the performance again and makes the syntax look much nicer in my opinion.

julia> using BenchmarkTools, StaticArrays

julia> function forces_naive!(f, x)
         n = length(x)
         for i in 1:n-1
           for j in i+1:n
             dx = x[i] - x[j] # naive, allocates
             f[i] .= f[i] .+ dx
             f[j] .= f[j] .- dx
           end
         end
       end

julia> function forces_static!(f, x)
         n = length(x)
         for i in 1:n-1
           for j in i+1:n
             dx = SVector{3,Float64}(x[i][1] - x[j][1],
                                     x[i][2] - x[j][2],
                                     x[i][3] - x[j][3])
             f[i] .= f[i] .+ dx
             f[j] .= f[j] .- dx
           end
         end
       end

julia> N = 1000

julia> f = [zeros(3) for i in 1:N]; x = [rand(3) for i in 1:N];

julia> @benchmark forces_naive!($f, $x)
Forces dx naive: BenchmarkTools.Trial:
  memory estimate:  53.35 MiB
  allocs estimate:  499500
  --------------
  minimum time:     18.641 ms (1.34% GC)
  median time:      19.122 ms (1.50% GC)
  mean time:        19.363 ms (1.86% GC)
  maximum time:     29.416 ms (1.51% GC)
  --------------
  samples:          259
  evals/sample:     1

julia> @benchmark forces_static!($f, $x)
Forces dx static: BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.671 ms (0.00% GC)
  median time:      3.794 ms (0.00% GC)
  mean time:        3.879 ms (0.00% GC)
  maximum time:     7.155 ms (0.00% GC)
  --------------
  samples:          1289
  evals/sample:     1

julia> function forces_static2!(f, x)
           n = length(x)
           for i in 1:n-1
               for j in i+1:n
                   dx = x[i] - x[j]
                   f[i] = f[i] + dx
                   f[j] = f[j] - dx
               end
           end
       end

julia> f2 = [@SVector zeros(3) for _ in 1:N]; x2 = [@SVector rand(3) for _ in 1:N];

julia> @benchmark forces_static2!($f2, $x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     969.887 μs (0.00% GC)
  median time:      971.940 μs (0.00% GC)
  mean time:        984.720 μs (0.00% GC)
  maximum time:     1.780 ms (0.00% GC)
  --------------
  samples:          5077
  evals/sample:     1

Vasily_Pisarev · February 12, 2021, 7:34am

The compiler cannot allocate dx on stack because it cannot be sure that it always has the same size based on types of the arguments.
You may allocate it in advance and update as dx .= x[i] .- x[j].
But for max performance you’d need to store coords and forces as SVectors or 3-tuples.

The JIT you mention might be a tracing JIT, meaning that it “notices” that dx always has the same size and then generates code under that assumption, but ready to de-optimize if it breaks. Julia doesn’t use runtime information for compilation AFAIK.

DNF · February 12, 2021, 8:12am

dx = @SVector [x[i][k] - x[j][k] for k in 1:3]

Edit: removed the unnecessary dot inside the comprehension.

lmiq · February 12, 2021, 9:40am

@ranocha , yes, that is the best way to do it, but I was wondering about the cost of not using static arrays.

@DNF , thanks, I thought that allocated a vector before creating the static one. @SVector x[i]-x[j] cannot do the same because of the same reasons @Vasily_Pisarev pointed, right?

DNF · February 12, 2021, 12:00pm

No, it doesn’t allocate. The macro rewrites the comprehension.

(I just realized that there was a dot on the - which was not needed, and was possibly confusing.)

cgeoga · February 12, 2021, 7:41pm

You could also use SIMD.jl, which has no allocations:

using SIMD
function forces_simd!(x,f)
  n = length(x)
  for i in 1:n-1
    for j in i+1:n
      dx    = x[i] - x[j]
      f[i] += dx
      f[j] -= dx
    end
  end
end

N  = 1000
_f = [Vec{3, Float64}(0) for i in 1:N]
_x = [Vec{3, Float64}((randn(), randn(), randn()))  for i in 1:N]
print("Forces dx simd: ")
@btime forces_simd!($_x,$_f)

I didn’t check the speed versus StaticArrays. I just think SIMD.jl is cool and was curious if this would run with no allocations, which it does.

Topic		Replies	Views
Repeated contractions - Vector-of-StaticArrays vs HybridArrays.jl, discrepancy in performance between Float64 vs Float32 Performance question , staticarrays	9	204	August 4, 2024
Correct way to benchmark StaticVectors General Usage benchmarktools , staticarrays	4	147	September 13, 2024
Efficient operations on vectors of static things Performance linearalgebra , simulations	12	300	June 10, 2024
Local arrays in for loops New to Julia	55	1699	March 3, 2021
Performance of mutable static arrays and compilation cost General Usage	15	1166	July 17, 2018

Vector local to loop iteration vs. Static Vector

Related topics