SIMD.jl
package offers ultimate in speed for short vectors on x64 CPU:
aa = SIMD.Vec{4,Float32}((3, 1, 4, 1))
bb = SIMD.Vec{4,Float32}((4, 7, 2, 9))
function foo(a, b)
a+b
end
julia> @code_native debuginfo=:none syntax=:intel foo(aa,bb)
.text
mov rax, rdi
vmovaps xmm0, xmmword ptr [rsi]
vaddps xmm0, xmm0, xmmword ptr [rdx]
vmovaps xmmword ptr [rdi], xmm0
ret
I would like to verify that there is no alternative for a short, fixed length Float32 vector, which reliably compiles to operations on SIMD registers. I tried StaticArrays
, but I found some surprises:
a = SVector{4, Float32}(3, 1, 4, 1)
b = SVector{4, Float32}(2, 5, 7, 1)
julia> @code_native debuginfo=:none syntax=:intel foo(a,b)
.text
mov rax, rdi
vmovups xmm0, xmmword ptr [rsi]
vaddps xmm0, xmm0, xmmword ptr [rdx]
vmovups xmmword ptr [rdi], xmm0
ret
Very nice, the same as SIMD.jl
. Since I only care about 3D, I tried:
a = SVector{3, Float32}(3, 1, 4)
b = SVector{3, Float32}(2, 5, 7)
julia> @code_native debuginfo=:none syntax=:intel foo(a,b)
.text
mov rax, rdi
vmovss xmm0, dword ptr [rsi] # xmm0 = mem[0],zero,zero,zero
vaddss xmm0, xmm0, dword ptr [rdx]
vmovss xmm1, dword ptr [rsi + 4] # xmm1 = mem[0],zero,zero,zero
vaddss xmm1, xmm1, dword ptr [rdx + 4]
vmovss xmm2, dword ptr [rsi + 8] # xmm2 = mem[0],zero,zero,zero
vaddss xmm2, xmm2, dword ptr [rdx + 8]
vmovss dword ptr [rdi], xmm0
vmovss dword ptr [rdi + 4], xmm1
vmovss dword ptr [rdi + 8], xmm2
ret
nop
This is much worse.
I’m not criticizing. StaticArrays
are generic. I’m willing to sacrifice genericity for speed, so it is not apples to apples comparision. I just want to make sure I’m not missing an obvious alternative, before I implement a few 3D geometry algorithms, which have to be very fast.