# SVector vs Vec usage: Why do I have an 8x speedup in a simple example?

I was playing with both `SVector` and `Vec` from StaticArrays and SIMD respectively.

I was surprised obtaining an 8x speedup using `SVector` vs `Vec`. Am I using `Vec`in a non suitable example or maybe I’m not using it properly?
Do I get this results only because x and y don’t change and StaticArrays makes some sort of optimization because of this?

#### Benchmark results

``````SIMD vectors
Trial(247.513 ns)
SVector vectors
Trial(39.496 ns)
``````

#### Code to benchmark both functions

``````using StaticArrays
using BenchmarkTools
using SIMD

function make_n_sums_vec(x::Vec, y::Vec, n::Int)
aux = zero(x)
for i in 1:n
aux += x + y
end
return aux
end

function make_n_sums_sarr(x::SArray, y::SArray, n::Int)
aux = zero(x)
for i in 1:n
aux += x + y
end
return aux
end

x_vec  = Vec{4,Int64}((1,2,3,4))
x_sarr = SVector{4,Int64}([1,2,3,4]);

# The result should  be `x_vec * 2 * 100`
println("SIMD vectors")
println(@benchmark make_n_sums_vec(x_vec, x_vec, 100))

println("SVector vectors")
println(@benchmark make_n_sums_sarr(x_sarr, x_sarr, 100))
``````
##### Looking at the native code it seems both use vector instructions `vpaddq` to make the additions
``````@code_native x_vec + x_vec

.section	__TEXT,__text,regular,pure_instructions
; ┌ @ SIMD.jl:1020 within `+'
; │┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; ││┌ @ SIMD.jl:1020 within `macro expansion'
vmovdqa	(%edx), %xmm0
vmovdqa	16(%edx), %xmm1
vpaddq	16(%esi), %xmm1, %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vinsertf128	\$1, %xmm1, %ymm0, %ymm0
; │└└
vmovaps	%ymm0, (%edi)
decl	%eax
movl	%edi, %eax
vzeroupper
retl
nopw	%cs:(%eax,%eax)
; └
``````
``````@code_native x_sarr + x_sarr
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:10 within `+'
; │┌ @ mapreduce.jl:17 within `map'
; ││┌ @ mapreduce.jl:21 within `_map'
; │││┌ @ mapreduce.jl:41 within `macro expansion'
; ││││┌ @ linalg.jl:10 within `+'
vmovdqu	(%edx), %xmm0
vmovdqu	16(%edx), %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└
vmovdqu	%xmm1, 16(%edi)
vmovdqu	%xmm0, (%edi)
decl	%eax
movl	%edi, %eax
retl
nop
; └
``````

Firstly, `x_vec` and `x_sarr` should be `const` or interpolated.

On 1.2 with my CPU I see no difference at all in the generated code:

``````julia> g(aux, x, y) = aux += x + y
g (generic function with 1 method)

julia> @code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
.section        __TEXT,__text,regular,pure_instructions
vmovdqu (%rcx), %ymm0
vpaddq  (%rdx), %ymm0, %ymm0
vpaddq  (%rsi), %ymm0, %ymm0
vmovdqu %ymm0, (%rdi)
movq    %rdi, %rax
vzeroupper
retq
nopw    (%rax,%rax)

julia> @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
.section        __TEXT,__text,regular,pure_instructions
vmovdqu (%rcx), %ymm0
vpaddq  (%rdx), %ymm0, %ymm0
vpaddq  (%rsi), %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
movq    %rdi, %rax
vzeroupper
retq
nopw    (%rax,%rax)
``````
Thank you @kristoffer.carlsson for showing me your output.

Could you tell me the which SIMD version are you using?

My julia version

`````` Version 1.1.0 (2019-01-21)
``````

SIMD version

``````  [fdea26ae] SIMD v2.8.0
``````

I get different code

``````@code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ linalg.jl:10 within `+'
; ││┌ @ mapreduce.jl:17 within `map'
; │││┌ @ mapreduce.jl:21 within `_map'
; ││││┌ @ mapreduce.jl:41 within `macro expansion'
; │││││┌ @ In[62]:1 within `+'
vmovdqu	(%ecx), %xmm0
vmovdqu	16(%ecx), %xmm1
vpaddq	16(%edx), %xmm1, %xmm1
vpaddq	(%edx), %xmm0, %xmm0
vpaddq	(%esi), %xmm0, %xmm0
vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└└
vmovdqu	%xmm1, 16(%edi)
vmovdqu	%xmm0, (%edi)
decl	%eax
movl	%edi, %eax
retl
nopl	(%eax,%eax)
; └
``````
`````` @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ SIMD.jl:1020 within `+'
; ││┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; │││┌ @ In[62]:1 within `macro expansion'
vmovdqa	(%ecx), %xmm0
vmovdqa	16(%ecx), %xmm1
vpaddq	(%edx), %xmm0, %xmm0
vpaddq	16(%edx), %xmm1, %xmm1
vpaddq	16(%esi), %xmm1, %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vinsertf128	\$1, %xmm1, %ymm0, %ymm0
; │└└└
vmovaps	%ymm0, (%edi)
decl	%eax
movl	%edi, %eax
vzeroupper
retl
nopl	(%eax)
; └
``````

I will try with Julia 1.2 once it´s in `https://julialang.org/downloads/`. Maybe it´s because of the julia version …

Or maybe your CPU doesn’t support AVX2 since it doesn’t seem to be using the 256 bit simd operations.

You are right about the fact that my CPU does not support AVX2 since it is not listed in the intel web for `i7-3720QM`.

In any case, I still don’t get the speed difference. If I test the function you provide for Vec and SVector, the static array is still 8x faster. No matter the instruction set the CPU has, I am testing `g(aux,x,y)` in the same machine, therefore I was expecting similar performance.
Maybe this statement is completly wrong and I should expect a speed difference (I am not sure about this).

Benchmark `Vec`

``````x_vec  = Vec{4,Int64}((1,2,3,4))
aux_vec = zero(x_vec)
@benchmark g(\$aux_vec, \$x_vec, \$x_vec)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.265 ns (0.00% GC)
median time:      2.407 ns (0.00% GC)
mean time:        2.566 ns (0.00% GC)
maximum time:     38.355 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

Benchmark `SVector`

``````x_sarr = SVector{4,Int64}([1,2,3,4]);
aux_sarr = zero(x_sarr)
@benchmark g(\$aux_sarr, \$x_sarr, \$x_sarr)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     0.018 ns (0.00% GC)
median time:      0.029 ns (0.00% GC)
mean time:        0.035 ns (0.00% GC)
maximum time:     8.450 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

This is less than a CPU cycle, so literally no work is being done at runtime. Since you’re interpolating constants into `@benchmark` and the code is ‘simple enough’, the entire computation is done at compile time.

This makes sense !

For static arrays the result has a constant runtime ( no matter the number of iterations I have in the loop).

Maybe the compiler was smart enough to find that the output is simply `x_vec * 2 * n_iterations`

``````@benchmark make_n_sums_sarr(\$x_sarr, \$x_sarr, 100)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.535 ns (0.00% GC)
median time:      2.769 ns (0.00% GC)
mean time:        3.032 ns (0.00% GC)
maximum time:     33.998 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

@benchmark make_n_sums_sarr(\$x_sarr, \$x_sarr, 10_000)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.257 ns (0.00% GC)
median time:      2.694 ns (0.00% GC)
mean time:        2.633 ns (0.00% GC)
maximum time:     57.030 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

This doesn´t happen with the SIMD version

``````@benchmark make_n_sums_vec(\$x_vec, \$x_vec, 100)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     215.142 ns (0.00% GC)
median time:      228.391 ns (0.00% GC)
mean time:        239.387 ns (0.00% GC)
maximum time:     570.735 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     520

@benchmark make_n_sums_vec(\$x_vec, \$x_vec, 1000)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.223 μs (0.00% GC)
median time:      2.354 μs (0.00% GC)
mean time:        2.452 μs (0.00% GC)
maximum time:     9.436 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9
````````

See my comment from earlier this week for some more benchmarking tips.

