SVector vs Vec usage: Why do I have an 8x speedup in a simple example?

Hello,

I was playing with both SVector and Vec from StaticArrays and SIMD respectively.

I was surprised obtaining an 8x speedup using SVector vs Vec. Am I using Vecin a non suitable example or maybe I’m not using it properly?
Do I get this results only because x and y don’t change and StaticArrays makes some sort of optimization because of this?

Benchmark results

SIMD vectors
Trial(247.513 ns)
SVector vectors
Trial(39.496 ns)

Code to benchmark both functions

using StaticArrays
using BenchmarkTools
using SIMD

function make_n_sums_vec(x::Vec, y::Vec, n::Int)
    aux = zero(x)
    for i in 1:n
        aux += x + y
    end
    return aux
end

function make_n_sums_sarr(x::SArray, y::SArray, n::Int)
    aux = zero(x)
    for i in 1:n
        aux += x + y
    end
    return aux
end

x_vec  = Vec{4,Int64}((1,2,3,4))
x_sarr = SVector{4,Int64}([1,2,3,4]);

# The result should  be `x_vec * 2 * 100`
println("SIMD vectors")
println(@benchmark make_n_sums_vec(x_vec, x_vec, 100))

println("SVector vectors")
println(@benchmark make_n_sums_sarr(x_sarr, x_sarr, 100))
Looking at the native code it seems both use vector instructions vpaddq to make the additions
@code_native x_vec + x_vec

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ SIMD.jl:1020 within `+'
; │┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; ││┌ @ SIMD.jl:1020 within `macro expansion'
	vmovdqa	(%edx), %xmm0
	vmovdqa	16(%edx), %xmm1
	vpaddq	16(%esi), %xmm1, %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
; │└└
	vmovaps	%ymm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	vzeroupper
	retl
	nopw	%cs:(%eax,%eax)
; └
@code_native x_sarr + x_sarr
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:10 within `+'
; │┌ @ mapreduce.jl:17 within `map'
; ││┌ @ mapreduce.jl:21 within `_map'
; │││┌ @ mapreduce.jl:41 within `macro expansion'
; ││││┌ @ linalg.jl:10 within `+'
	vmovdqu	(%edx), %xmm0
	vmovdqu	16(%edx), %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└
	vmovdqu	%xmm1, 16(%edi)
	vmovdqu	%xmm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	retl
	nop
; └

Firstly, x_vec and x_sarr should be const or interpolated.

On 1.2 with my CPU I see no difference at all in the generated code:

julia> g(aux, x, y) = aux += x + y
g (generic function with 1 method)

julia> @code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
        .section        __TEXT,__text,regular,pure_instructions
        vmovdqu (%rcx), %ymm0
        vpaddq  (%rdx), %ymm0, %ymm0
        vpaddq  (%rsi), %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi)
        movq    %rdi, %rax
        vzeroupper
        retq
        nopw    (%rax,%rax)

julia> @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
        .section        __TEXT,__text,regular,pure_instructions
        vmovdqu (%rcx), %ymm0
        vpaddq  (%rdx), %ymm0, %ymm0
        vpaddq  (%rsi), %ymm0, %ymm0
        vmovdqa %ymm0, (%rdi)
        movq    %rdi, %rax
        vzeroupper
        retq
        nopw    (%rax,%rax)
1 Like

Thank you @kristoffer.carlsson for showing me your output.

Could you tell me the which SIMD version are you using?

My julia version

 Version 1.1.0 (2019-01-21)

SIMD version

  [fdea26ae] SIMD v2.8.0

I get different code

@code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ linalg.jl:10 within `+'
; ││┌ @ mapreduce.jl:17 within `map'
; │││┌ @ mapreduce.jl:21 within `_map'
; ││││┌ @ mapreduce.jl:41 within `macro expansion'
; │││││┌ @ In[62]:1 within `+'
	vmovdqu	(%ecx), %xmm0
	vmovdqu	16(%ecx), %xmm1
	vpaddq	16(%edx), %xmm1, %xmm1
	vpaddq	(%edx), %xmm0, %xmm0
	vpaddq	(%esi), %xmm0, %xmm0
	vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└└
	vmovdqu	%xmm1, 16(%edi)
	vmovdqu	%xmm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	retl
	nopl	(%eax,%eax)
; └
 @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ SIMD.jl:1020 within `+'
; ││┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; │││┌ @ In[62]:1 within `macro expansion'
	vmovdqa	(%ecx), %xmm0
	vmovdqa	16(%ecx), %xmm1
	vpaddq	(%edx), %xmm0, %xmm0
	vpaddq	16(%edx), %xmm1, %xmm1
	vpaddq	16(%esi), %xmm1, %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
; │└└└
	vmovaps	%ymm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	vzeroupper
	retl
	nopl	(%eax)
; └

I will try with Julia 1.2 once it´s in https://julialang.org/downloads/. Maybe it´s because of the julia version …

Or maybe your CPU doesn’t support AVX2 since it doesn’t seem to be using the 256 bit simd operations.

2 Likes

You are right about the fact that my CPU does not support AVX2 since it is not listed in the intel web for i7-3720QM.

In any case, I still don’t get the speed difference. If I test the function you provide for Vec and SVector, the static array is still 8x faster. No matter the instruction set the CPU has, I am testing g(aux,x,y) in the same machine, therefore I was expecting similar performance.
Maybe this statement is completly wrong and I should expect a speed difference (I am not sure about this).

Benchmark Vec

x_vec  = Vec{4,Int64}((1,2,3,4))
aux_vec = zero(x_vec)
@benchmark g($aux_vec, $x_vec, $x_vec)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.265 ns (0.00% GC)
  median time:      2.407 ns (0.00% GC)
  mean time:        2.566 ns (0.00% GC)
  maximum time:     38.355 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Benchmark SVector

x_sarr = SVector{4,Int64}([1,2,3,4]);
aux_sarr = zero(x_sarr)
@benchmark g($aux_sarr, $x_sarr, $x_sarr) 
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.018 ns (0.00% GC)
  median time:      0.029 ns (0.00% GC)
  mean time:        0.035 ns (0.00% GC)
  maximum time:     8.450 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

This is less than a CPU cycle, so literally no work is being done at runtime. Since you’re interpolating constants into @benchmark and the code is ‘simple enough’, the entire computation is done at compile time.

4 Likes

This makes sense !

For static arrays the result has a constant runtime ( no matter the number of iterations I have in the loop).

Maybe the compiler was smart enough to find that the output is simply x_vec * 2 * n_iterations

@benchmark make_n_sums_sarr($x_sarr, $x_sarr, 100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.535 ns (0.00% GC)
  median time:      2.769 ns (0.00% GC)
  mean time:        3.032 ns (0.00% GC)
  maximum time:     33.998 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000


@benchmark make_n_sums_sarr($x_sarr, $x_sarr, 10_000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.257 ns (0.00% GC)
  median time:      2.694 ns (0.00% GC)
  mean time:        2.633 ns (0.00% GC)
  maximum time:     57.030 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

This doesn´t happen with the SIMD version

@benchmark make_n_sums_vec($x_vec, $x_vec, 100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     215.142 ns (0.00% GC)
  median time:      228.391 ns (0.00% GC)
  mean time:        239.387 ns (0.00% GC)
  maximum time:     570.735 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     520

@benchmark make_n_sums_vec($x_vec, $x_vec, 1000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.223 μs (0.00% GC)
  median time:      2.354 μs (0.00% GC)
  mean time:        2.452 μs (0.00% GC)
  maximum time:     9.436 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9
``

See my comment from earlier this week for some more benchmarking tips.

2 Likes