# SVector vs Vec usage: Why do I have an 8x speedup in a simple example?

Hello,

I was playing with both `SVector` and `Vec` from StaticArrays and SIMD respectively.

I was surprised obtaining an 8x speedup using `SVector` vs `Vec`. Am I using `Vec`in a non suitable example or maybe I’m not using it properly?
Do I get this results only because x and y don’t change and StaticArrays makes some sort of optimization because of this?

#### Benchmark results

``````SIMD vectors
Trial(247.513 ns)
SVector vectors
Trial(39.496 ns)
``````

#### Code to benchmark both functions

``````using StaticArrays
using BenchmarkTools
using SIMD

function make_n_sums_vec(x::Vec, y::Vec, n::Int)
aux = zero(x)
for i in 1:n
aux += x + y
end
return aux
end

function make_n_sums_sarr(x::SArray, y::SArray, n::Int)
aux = zero(x)
for i in 1:n
aux += x + y
end
return aux
end

x_vec  = Vec{4,Int64}((1,2,3,4))
x_sarr = SVector{4,Int64}([1,2,3,4]);

# The result should  be `x_vec * 2 * 100`
println("SIMD vectors")
println(@benchmark make_n_sums_vec(x_vec, x_vec, 100))

println("SVector vectors")
println(@benchmark make_n_sums_sarr(x_sarr, x_sarr, 100))
``````
##### Looking at the native code it seems both use vector instructions `vpaddq` to make the additions
``````@code_native x_vec + x_vec

.section	__TEXT,__text,regular,pure_instructions
; ┌ @ SIMD.jl:1020 within `+'
; │┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; ││┌ @ SIMD.jl:1020 within `macro expansion'
vmovdqa	(%edx), %xmm0
vmovdqa	16(%edx), %xmm1
vpaddq	16(%esi), %xmm1, %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vinsertf128	\$1, %xmm1, %ymm0, %ymm0
; │└└
vmovaps	%ymm0, (%edi)
decl	%eax
movl	%edi, %eax
vzeroupper
retl
nopw	%cs:(%eax,%eax)
; └
``````
``````@code_native x_sarr + x_sarr
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:10 within `+'
; │┌ @ mapreduce.jl:17 within `map'
; ││┌ @ mapreduce.jl:21 within `_map'
; │││┌ @ mapreduce.jl:41 within `macro expansion'
; ││││┌ @ linalg.jl:10 within `+'
vmovdqu	(%edx), %xmm0
vmovdqu	16(%edx), %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└
vmovdqu	%xmm1, 16(%edi)
vmovdqu	%xmm0, (%edi)
decl	%eax
movl	%edi, %eax
retl
nop
; └
``````

Firstly, `x_vec` and `x_sarr` should be `const` or interpolated.

On 1.2 with my CPU I see no difference at all in the generated code:

``````julia> g(aux, x, y) = aux += x + y
g (generic function with 1 method)

julia> @code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
.section        __TEXT,__text,regular,pure_instructions
vmovdqu (%rcx), %ymm0
vpaddq  (%rdx), %ymm0, %ymm0
vpaddq  (%rsi), %ymm0, %ymm0
vmovdqu %ymm0, (%rdi)
movq    %rdi, %rax
vzeroupper
retq
nopw    (%rax,%rax)

julia> @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
.section        __TEXT,__text,regular,pure_instructions
vmovdqu (%rcx), %ymm0
vpaddq  (%rdx), %ymm0, %ymm0
vpaddq  (%rsi), %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
movq    %rdi, %rax
vzeroupper
retq
nopw    (%rax,%rax)
``````
1 Like

Thank you @kristoffer.carlsson for showing me your output.

Could you tell me the which SIMD version are you using?

My julia version

`````` Version 1.1.0 (2019-01-21)
``````

SIMD version

``````  [fdea26ae] SIMD v2.8.0
``````

I get different code

``````@code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ linalg.jl:10 within `+'
; ││┌ @ mapreduce.jl:17 within `map'
; │││┌ @ mapreduce.jl:21 within `_map'
; ││││┌ @ mapreduce.jl:41 within `macro expansion'
; │││││┌ @ In[62]:1 within `+'
vmovdqu	(%ecx), %xmm0
vmovdqu	16(%ecx), %xmm1
vpaddq	16(%edx), %xmm1, %xmm1
vpaddq	(%edx), %xmm0, %xmm0
vpaddq	(%esi), %xmm0, %xmm0
vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└└
vmovdqu	%xmm1, 16(%edi)
vmovdqu	%xmm0, (%edi)
decl	%eax
movl	%edi, %eax
retl
nopl	(%eax,%eax)
; └
``````
`````` @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ SIMD.jl:1020 within `+'
; ││┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; │││┌ @ In[62]:1 within `macro expansion'
vmovdqa	(%ecx), %xmm0
vmovdqa	16(%ecx), %xmm1
vpaddq	(%edx), %xmm0, %xmm0
vpaddq	16(%edx), %xmm1, %xmm1
vpaddq	16(%esi), %xmm1, %xmm1
vpaddq	(%esi), %xmm0, %xmm0
vinsertf128	\$1, %xmm1, %ymm0, %ymm0
; │└└└
vmovaps	%ymm0, (%edi)
decl	%eax
movl	%edi, %eax
vzeroupper
retl
nopl	(%eax)
; └
``````

I will try with Julia 1.2 once it´s in `https://julialang.org/downloads/`. Maybe it´s because of the julia version …

Or maybe your CPU doesn’t support AVX2 since it doesn’t seem to be using the 256 bit simd operations.

1 Like

You are right about the fact that my CPU does not support AVX2 since it is not listed in the intel web for `i7-3720QM`.

In any case, I still don’t get the speed difference. If I test the function you provide for Vec and SVector, the static array is still 8x faster. No matter the instruction set the CPU has, I am testing `g(aux,x,y)` in the same machine, therefore I was expecting similar performance.
Maybe this statement is completly wrong and I should expect a speed difference (I am not sure about this).

Benchmark `Vec`

``````x_vec  = Vec{4,Int64}((1,2,3,4))
aux_vec = zero(x_vec)
@benchmark g(\$aux_vec, \$x_vec, \$x_vec)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.265 ns (0.00% GC)
median time:      2.407 ns (0.00% GC)
mean time:        2.566 ns (0.00% GC)
maximum time:     38.355 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

Benchmark `SVector`

``````x_sarr = SVector{4,Int64}([1,2,3,4]);
aux_sarr = zero(x_sarr)
@benchmark g(\$aux_sarr, \$x_sarr, \$x_sarr)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     0.018 ns (0.00% GC)
median time:      0.029 ns (0.00% GC)
mean time:        0.035 ns (0.00% GC)
maximum time:     8.450 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

This is less than a CPU cycle, so literally no work is being done at runtime. Since you’re interpolating constants into `@benchmark` and the code is ‘simple enough’, the entire computation is done at compile time.

3 Likes

This makes sense !

For static arrays the result has a constant runtime ( no matter the number of iterations I have in the loop).

Maybe the compiler was smart enough to find that the output is simply `x_vec * 2 * n_iterations`

``````@benchmark make_n_sums_sarr(\$x_sarr, \$x_sarr, 100)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.535 ns (0.00% GC)
median time:      2.769 ns (0.00% GC)
mean time:        3.032 ns (0.00% GC)
maximum time:     33.998 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

@benchmark make_n_sums_sarr(\$x_sarr, \$x_sarr, 10_000)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.257 ns (0.00% GC)
median time:      2.694 ns (0.00% GC)
mean time:        2.633 ns (0.00% GC)
maximum time:     57.030 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
``````

This doesn´t happen with the SIMD version

``````@benchmark make_n_sums_vec(\$x_vec, \$x_vec, 100)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     215.142 ns (0.00% GC)
median time:      228.391 ns (0.00% GC)
mean time:        239.387 ns (0.00% GC)
maximum time:     570.735 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     520

@benchmark make_n_sums_vec(\$x_vec, \$x_vec, 1000)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.223 μs (0.00% GC)
median time:      2.354 μs (0.00% GC)
mean time:        2.452 μs (0.00% GC)
maximum time:     9.436 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9
````````

See my comment from earlier this week for some more benchmarking tips.

2 Likes