# Vector local to loop iteration vs. Static Vector

In simple the example below, I write two functions that compute “the force” between the particles of a set. To compute the forces between each pair of particles one needs to compute the difference in position of the two particles (the `dx` vector).

I have implemented two versions of the minimal example: one in which the `dx` vector is just computed naively, the other in which it is computed as a static array. There is no other difference in the codes.

As expected, the static array version does not allocate, and is faster. Yet, that left me a bit unsatisfied. The vector `dx` is local to the inner iteration of the loop. I had the expectation that the compiler would figure that out and stack-allocate it as well in the “naive” case. It does not, and the difference in performance is huge.

``````Forces dx naive:   23.923 ms (499500 allocations: 53.35 MiB)
Forces dx static:   4.397 ms (0 allocations: 0 bytes)
``````

Here is the code:

Code
``````using BenchmarkTools

function forces_naive!(x,f)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx = x[i] - x[j] # naive, allocates
f[i] .= f[i] .+ dx
f[j] .= f[j] .- dx
end
end
end

using StaticArrays
function forces_static!(x,f)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx = SVector{3,Float64}(x[i] - x[j],
x[i] - x[j],
x[i] - x[j])
f[i] .= f[i] .+ dx
f[j] .= f[j] .- dx
end
end
end

N = 1000
f = [ zeros(3) for i in 1:N ]
x = [ rand(3) for i in 1:N ]

print("Forces dx naive: ");@btime forces_naive!(\$x,\$f)

print("Forces dx static: ");@btime forces_static!(\$x,\$f)
``````

My more general point is: is it unrealistic to expect that the compiler does something better with that vector, even if one does not explicitly say that it is static? Isn’t this a case where one could expect that?

(as a provocative remark, I think this occurs in a code compiled with the `@jit` flag in some python implementation of a similar thing - that is where this comes from - not that I know the details of the python version, but while we can get a faster Julia version using static arrays, tentative “naive” implementations like this turn out to be faster in the jit compiled python version, and very slow in the Julia version, and I think this is one of the reasons).

The more specific question is: is there a prettier syntax to write this?

``````      dx = SVector{3,Float64}(x[i] - x[j],
x[i] - x[j],
x[i] - x[j])
``````

I mean, it wouldn’t be bad that, for example `@SVector x[i]-x[j]` got interpreted as that, for example.

2 Likes

I would probably just create arrays of `SVector`s. That improves the performance again and makes the syntax look much nicer in my opinion.

``````julia> using BenchmarkTools, StaticArrays

julia> function forces_naive!(f, x)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx = x[i] - x[j] # naive, allocates
f[i] .= f[i] .+ dx
f[j] .= f[j] .- dx
end
end
end

julia> function forces_static!(f, x)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx = SVector{3,Float64}(x[i] - x[j],
x[i] - x[j],
x[i] - x[j])
f[i] .= f[i] .+ dx
f[j] .= f[j] .- dx
end
end
end

julia> N = 1000

julia> f = [zeros(3) for i in 1:N]; x = [rand(3) for i in 1:N];

julia> @benchmark forces_naive!(\$f, \$x)
Forces dx naive: BenchmarkTools.Trial:
memory estimate:  53.35 MiB
allocs estimate:  499500
--------------
minimum time:     18.641 ms (1.34% GC)
median time:      19.122 ms (1.50% GC)
mean time:        19.363 ms (1.86% GC)
maximum time:     29.416 ms (1.51% GC)
--------------
samples:          259
evals/sample:     1

julia> @benchmark forces_static!(\$f, \$x)
Forces dx static: BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.671 ms (0.00% GC)
median time:      3.794 ms (0.00% GC)
mean time:        3.879 ms (0.00% GC)
maximum time:     7.155 ms (0.00% GC)
--------------
samples:          1289
evals/sample:     1

julia> function forces_static2!(f, x)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx = x[i] - x[j]
f[i] = f[i] + dx
f[j] = f[j] - dx
end
end
end

julia> f2 = [@SVector zeros(3) for _ in 1:N]; x2 = [@SVector rand(3) for _ in 1:N];

julia> @benchmark forces_static2!(\$f2, \$x2)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     969.887 μs (0.00% GC)
median time:      971.940 μs (0.00% GC)
mean time:        984.720 μs (0.00% GC)
maximum time:     1.780 ms (0.00% GC)
--------------
samples:          5077
evals/sample:     1
``````
4 Likes

The compiler cannot allocate `dx` on stack because it cannot be sure that it always has the same size based on types of the arguments.
You may allocate it in advance and update as `dx .= x[i] .- x[j]`.
But for max performance you’d need to store coords and forces as `SVector`s or 3-tuples.

The JIT you mention might be a tracing JIT, meaning that it “notices” that `dx` always has the same size and then generates code under that assumption, but ready to de-optimize if it breaks. Julia doesn’t use runtime information for compilation AFAIK.

5 Likes
``````dx = @SVector [x[i][k] - x[j][k] for k in 1:3]
``````

Edit: removed the unnecessary dot inside the comprehension.

4 Likes

@ranocha , yes, that is the best way to do it, but I was wondering about the cost of not using static arrays.

@DNF , thanks, I thought that allocated a vector before creating the static one. `@SVector x[i]-x[j]` cannot do the same because of the same reasons @Vasily_Pisarev pointed, right?

1 Like

No, it doesn’t allocate. The macro rewrites the comprehension.

(I just realized that there was a dot on the `-` which was not needed, and was possibly confusing.)

6 Likes

You could also use `SIMD.jl`, which has no allocations:

``````using SIMD
function forces_simd!(x,f)
n = length(x)
for i in 1:n-1
for j in i+1:n
dx    = x[i] - x[j]
f[i] += dx
f[j] -= dx
end
end
end

N  = 1000
_f = [Vec{3, Float64}(0) for i in 1:N]
_x = [Vec{3, Float64}((randn(), randn(), randn()))  for i in 1:N]
print("Forces dx simd: ")
@btime forces_simd!(\$_x,\$_f)
``````

I didn’t check the speed versus StaticArrays. I just think `SIMD.jl` is cool and was curious if this would run with no allocations, which it does.

3 Likes