SVector vs Vec usage: Why do I have an 8x speedup in a simple example?

davidbp · August 17, 2019, 10:10am

Hello,

I was playing with both SVector and Vec from StaticArrays and SIMD respectively.

I was surprised obtaining an 8x speedup using SVector vs Vec. Am I using Vecin a non suitable example or maybe I’m not using it properly?
Do I get this results only because x and y don’t change and StaticArrays makes some sort of optimization because of this?

Benchmark results

SIMD vectors
Trial(247.513 ns)
SVector vectors
Trial(39.496 ns)

Code to benchmark both functions

using StaticArrays
using BenchmarkTools
using SIMD

function make_n_sums_vec(x::Vec, y::Vec, n::Int)
    aux = zero(x)
    for i in 1:n
        aux += x + y
    end
    return aux
end

function make_n_sums_sarr(x::SArray, y::SArray, n::Int)
    aux = zero(x)
    for i in 1:n
        aux += x + y
    end
    return aux
end

x_vec  = Vec{4,Int64}((1,2,3,4))
x_sarr = SVector{4,Int64}([1,2,3,4]);

# The result should  be `x_vec * 2 * 100`
println("SIMD vectors")
println(@benchmark make_n_sums_vec(x_vec, x_vec, 100))

println("SVector vectors")
println(@benchmark make_n_sums_sarr(x_sarr, x_sarr, 100))

Looking at the native code it seems both use vector instructions `vpaddq` to make the additions

@code_native x_vec + x_vec

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ SIMD.jl:1020 within `+'
; │┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; ││┌ @ SIMD.jl:1020 within `macro expansion'
	vmovdqa	(%edx), %xmm0
	vmovdqa	16(%edx), %xmm1
	vpaddq	16(%esi), %xmm1, %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
; │└└
	vmovaps	%ymm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	vzeroupper
	retl
	nopw	%cs:(%eax,%eax)
; └

@code_native x_sarr + x_sarr
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ linalg.jl:10 within `+'
; │┌ @ mapreduce.jl:17 within `map'
; ││┌ @ mapreduce.jl:21 within `_map'
; │││┌ @ mapreduce.jl:41 within `macro expansion'
; ││││┌ @ linalg.jl:10 within `+'
	vmovdqu	(%edx), %xmm0
	vmovdqu	16(%edx), %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└
	vmovdqu	%xmm1, 16(%edi)
	vmovdqu	%xmm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	retl
	nop
; └

kristoffer.carlsson · August 17, 2019, 10:29am

Firstly, x_vec and x_sarr should be const or interpolated.

On 1.2 with my CPU I see no difference at all in the generated code:

julia> g(aux, x, y) = aux += x + y
g (generic function with 1 method)

julia> @code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
        .section        __TEXT,__text,regular,pure_instructions
        vmovdqu (%rcx), %ymm0
        vpaddq  (%rdx), %ymm0, %ymm0
        vpaddq  (%rsi), %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi)
        movq    %rdi, %rax
        vzeroupper
        retq
        nopw    (%rax,%rax)

julia> @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
        .section        __TEXT,__text,regular,pure_instructions
        vmovdqu (%rcx), %ymm0
        vpaddq  (%rdx), %ymm0, %ymm0
        vpaddq  (%rsi), %ymm0, %ymm0
        vmovdqa %ymm0, (%rdi)
        movq    %rdi, %rax
        vzeroupper
        retq
        nopw    (%rax,%rax)

davidbp · August 17, 2019, 12:19pm

Thank you @kristoffer.carlsson for showing me your output.

Could you tell me the which SIMD version are you using?

My julia version

 Version 1.1.0 (2019-01-21)

SIMD version

  [fdea26ae] SIMD v2.8.0

I get different code

@code_native debuginfo=:none g(zero(x_sarr), x_sarr, x_sarr)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ linalg.jl:10 within `+'
; ││┌ @ mapreduce.jl:17 within `map'
; │││┌ @ mapreduce.jl:21 within `_map'
; ││││┌ @ mapreduce.jl:41 within `macro expansion'
; │││││┌ @ In[62]:1 within `+'
	vmovdqu	(%ecx), %xmm0
	vmovdqu	16(%ecx), %xmm1
	vpaddq	16(%edx), %xmm1, %xmm1
	vpaddq	(%edx), %xmm0, %xmm0
	vpaddq	(%esi), %xmm0, %xmm0
	vpaddq	16(%esi), %xmm1, %xmm1
; │└└└└└
	vmovdqu	%xmm1, 16(%edi)
	vmovdqu	%xmm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	retl
	nopl	(%eax,%eax)
; └

 @code_native debuginfo=:none g(zero(x_vec), x_vec, x_vec)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ In[62]:1 within `g'
; │┌ @ SIMD.jl:1020 within `+'
; ││┌ @ SIMD.jl:604 within `llvmwrap' @ SIMD.jl:604
; │││┌ @ In[62]:1 within `macro expansion'
	vmovdqa	(%ecx), %xmm0
	vmovdqa	16(%ecx), %xmm1
	vpaddq	(%edx), %xmm0, %xmm0
	vpaddq	16(%edx), %xmm1, %xmm1
	vpaddq	16(%esi), %xmm1, %xmm1
	vpaddq	(%esi), %xmm0, %xmm0
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
; │└└└
	vmovaps	%ymm0, (%edi)
	decl	%eax
	movl	%edi, %eax
	vzeroupper
	retl
	nopl	(%eax)
; └

I will try with Julia 1.2 once it´s in https://julialang.org/downloads/. Maybe it´s because of the julia version …

kristoffer.carlsson · August 17, 2019, 12:40pm

Or maybe your CPU doesn’t support AVX2 since it doesn’t seem to be using the 256 bit simd operations.

davidbp · August 17, 2019, 1:53pm

You are right about the fact that my CPU does not support AVX2 since it is not listed in the intel web for i7-3720QM.

In any case, I still don’t get the speed difference. If I test the function you provide for Vec and SVector, the static array is still 8x faster. No matter the instruction set the CPU has, I am testing g(aux,x,y) in the same machine, therefore I was expecting similar performance.
Maybe this statement is completly wrong and I should expect a speed difference (I am not sure about this).

Benchmark Vec

x_vec  = Vec{4,Int64}((1,2,3,4))
aux_vec = zero(x_vec)
@benchmark g($aux_vec, $x_vec, $x_vec)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.265 ns (0.00% GC)
  median time:      2.407 ns (0.00% GC)
  mean time:        2.566 ns (0.00% GC)
  maximum time:     38.355 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Benchmark SVector

x_sarr = SVector{4,Int64}([1,2,3,4]);
aux_sarr = zero(x_sarr)
@benchmark g($aux_sarr, $x_sarr, $x_sarr) 
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.018 ns (0.00% GC)
  median time:      0.029 ns (0.00% GC)
  mean time:        0.035 ns (0.00% GC)
  maximum time:     8.450 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

tkoolen · August 17, 2019, 2:04pm

This is less than a CPU cycle, so literally no work is being done at runtime. Since you’re interpolating constants into @benchmark and the code is ‘simple enough’, the entire computation is done at compile time.

davidbp · August 17, 2019, 2:10pm

This makes sense !

For static arrays the result has a constant runtime ( no matter the number of iterations I have in the loop).

Maybe the compiler was smart enough to find that the output is simply x_vec * 2 * n_iterations

@benchmark make_n_sums_sarr($x_sarr, $x_sarr, 100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.535 ns (0.00% GC)
  median time:      2.769 ns (0.00% GC)
  mean time:        3.032 ns (0.00% GC)
  maximum time:     33.998 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000


@benchmark make_n_sums_sarr($x_sarr, $x_sarr, 10_000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.257 ns (0.00% GC)
  median time:      2.694 ns (0.00% GC)
  mean time:        2.633 ns (0.00% GC)
  maximum time:     57.030 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

This doesn´t happen with the SIMD version

@benchmark make_n_sums_vec($x_vec, $x_vec, 100)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     215.142 ns (0.00% GC)
  median time:      228.391 ns (0.00% GC)
  mean time:        239.387 ns (0.00% GC)
  maximum time:     570.735 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     520

@benchmark make_n_sums_vec($x_vec, $x_vec, 1000)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.223 μs (0.00% GC)
  median time:      2.354 μs (0.00% GC)
  mean time:        2.452 μs (0.00% GC)
  maximum time:     9.436 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9
``

tkoolen · August 17, 2019, 2:12pm

See my comment from earlier this week for some more benchmarking tips.

Topic		Replies	Views
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1095	February 22, 2023
Fast short Float32 vector General Usage	9	552	July 7, 2020
Performance regression with StaticArrays? Performance question , staticarrays	5	469	January 27, 2023
DiffEq and SVector Modelling & Simulations	16	684	December 10, 2020
Help understanding vectorization (or lack thereof) Performance	15	1212	June 8, 2018

SVector vs Vec usage: Why do I have an 8x speedup in a simple example?

Benchmark results

Code to benchmark both functions

Looking at the native code it seems both use vector instructions vpaddq to make the additions

Related topics

Looking at the native code it seems both use vector instructions `vpaddq` to make the additions