Fast short Float32 vector

pauljurczak · July 7, 2020, 1:47am

SIMD.jl package offers ultimate in speed for short vectors on x64 CPU:

aa = SIMD.Vec{4,Float32}((3, 1, 4, 1))
bb = SIMD.Vec{4,Float32}((4, 7, 2, 9))

function foo(a, b) 
  a+b 
end

julia> @code_native debuginfo=:none syntax=:intel foo(aa,bb)
        .text
        mov     rax, rdi
        vmovaps xmm0, xmmword ptr [rsi]
        vaddps  xmm0, xmm0, xmmword ptr [rdx]
        vmovaps xmmword ptr [rdi], xmm0
        ret

I would like to verify that there is no alternative for a short, fixed length Float32 vector, which reliably compiles to operations on SIMD registers. I tried StaticArrays, but I found some surprises:

a = SVector{4, Float32}(3, 1, 4, 1)
b = SVector{4, Float32}(2, 5, 7, 1)

julia> @code_native debuginfo=:none syntax=:intel foo(a,b)
        .text
        mov     rax, rdi
        vmovups xmm0, xmmword ptr [rsi]
        vaddps  xmm0, xmm0, xmmword ptr [rdx]
        vmovups xmmword ptr [rdi], xmm0
        ret

Very nice, the same as SIMD.jl. Since I only care about 3D, I tried:

a = SVector{3, Float32}(3, 1, 4)
b = SVector{3, Float32}(2, 5, 7)

julia> @code_native debuginfo=:none syntax=:intel foo(a,b)
        .text
        mov     rax, rdi
        vmovss  xmm0, dword ptr [rsi]   # xmm0 = mem[0],zero,zero,zero
        vaddss  xmm0, xmm0, dword ptr [rdx]
        vmovss  xmm1, dword ptr [rsi + 4] # xmm1 = mem[0],zero,zero,zero
        vaddss  xmm1, xmm1, dword ptr [rdx + 4]
        vmovss  xmm2, dword ptr [rsi + 8] # xmm2 = mem[0],zero,zero,zero
        vaddss  xmm2, xmm2, dword ptr [rdx + 8]
        vmovss  dword ptr [rdi], xmm0
        vmovss  dword ptr [rdi + 4], xmm1
        vmovss  dword ptr [rdi + 8], xmm2
        ret
        nop

This is much worse.

I’m not criticizing. StaticArrays are generic. I’m willing to sacrifice genericity for speed, so it is not apples to apples comparision. I just want to make sure I’m not missing an obvious alternative, before I implement a few 3D geometry algorithms, which have to be very fast.

dpsanders · July 7, 2020, 3:08am

As I think you suggested in another thread, maybe you could just work with 4-dimensional SVectors, where the 4th entry is 0.

pauljurczak · July 7, 2020, 4:52am

I just run a few tests and I found performance penalty for using SVector:

a = SVector{4, Float32}(3, 1, 4, 1)
b = SVector{4, Float32}(2, 5, 7, 1)

julia> @code_native debuginfo=:none syntax=:intel min(a,b)
        .text
        mov     rax, rdi
        vmovd   xmm0, dword ptr [rdx]   # xmm0 = mem[0],zero,zero,zero
        vmovd   xmm1, dword ptr [rsi]   # xmm1 = mem[0],zero,zero,zero
        vmovd   r9d, xmm1
        vmovd   r10d, xmm0
        vucomiss        xmm1, xmm1
        setp    r8b
        vucomiss        xmm0, xmm0
        setp    dil
        cmp     r10d, r9d
        je      L55
        vmovdqa xmm2, xmm0
        vmovdqa xmm3, xmm1
        and     dil, r8b
        je      L123
L55:
        mov     edi, 1
        nop     dword ptr [rax]
L64:
        cmp     rdi, 3
        ja      L205
        vmovd   xmm2, dword ptr [rdx + 4*rdi] # xmm2 = mem[0],zero,zero,zero
        vmovd   xmm3, dword ptr [rsi + 4*rdi] # xmm3 = mem[0],zero,zero,zero
        lea     rdi, [rdi + 1]
        vmovd   r9d, xmm3
        vmovd   r10d, xmm2
        vucomiss        xmm3, xmm3
        setp    r8b
        vucomiss        xmm2, xmm2
        setp    cl
        cmp     r10d, r9d
        je      L64
        and     cl, r8b
        jne     L64
L123:
        test    r10d, r10d
        sets    r11b
        setns   dil
        cmp     r10d, r9d
        seta    r9b
        setl    cl
        and     r9b, r11b
        and     cl, dil
        or      cl, r9b
        vucomiss        xmm2, xmm3
        setnp   dil
        and     dil, cl
        vucomiss        xmm2, xmm2
        setnp   cl
        and     cl, r8b
        or      cl, dil
        xor     cl, 1
        test    cl, cl
        je      L211
L184:
        vmovss  xmm2, dword ptr [rsi + 4] # xmm2 = mem[0],zero,zero,zero
        vmovss  xmm3, dword ptr [rsi + 8] # xmm3 = mem[0],zero,zero,zero
        vmovss  xmm4, dword ptr [rsi + 12] # xmm4 = mem[0],zero,zero,zero
        vmovdqa xmm0, xmm1
        jmp     L226
L205:
        mov     cl, 1
        test    cl, cl
        jne     L184
L211:
        vmovss  xmm2, dword ptr [rdx + 4] # xmm2 = mem[0],zero,zero,zero
        vmovss  xmm3, dword ptr [rdx + 8] # xmm3 = mem[0],zero,zero,zero
        vmovss  xmm4, dword ptr [rdx + 12] # xmm4 = mem[0],zero,zero,zero
L226:
        vmovd   dword ptr [rax], xmm0
        vmovss  dword ptr [rax + 4], xmm2
        vmovss  dword ptr [rax + 8], xmm3
        vmovss  dword ptr [rax + 12], xmm4
        ret
        nop     word ptr cs:[rax + rax]

This looks really bad compared to:

aa = SIMD.Vec{4,Float32}((3, 1, 4, 1))
bb = SIMD.Vec{4,Float32}((4, 7, 2, 9))

julia> @code_native debuginfo=:none syntax=:intel min(aa,bb)
        .text
        mov     rax, rdi
        vmovaps xmm0, xmmword ptr [rsi]
        vmovaps xmm1, xmmword ptr [rdx]
        vminps  xmm2, xmm1, xmm0
        vcmpunordps     xmm0, xmm0, xmm0
        vblendvps       xmm0, xmm2, xmm1, xmm0
        vmovaps xmmword ptr [rdi], xmm0
        ret
        nop

dpsanders · July 7, 2020, 5:01am

These are not doing the same calculation. You need min.(a, b) for the SVectors, which gives something much more similar.

pauljurczak · July 7, 2020, 5:30am

Good catch! I need to catch up on my sleep, I suppose. There is still some overhead for SVector. BTW, other than putting min. inside a function, how do you work around this error?

julia> @code_native debuginfo=:none syntax=:intel min.(a,b)
ERROR: no unique matching method found for the specified argument types

dpsanders · July 7, 2020, 5:33am

I believe you have to put it inside a function.

pauljurczak · July 7, 2020, 6:29am

I figured it out:

code_native(broadcast, Tuple{typeof(min), SVector{4, Float32}, SVector{4, Float32}}, debuginfo=:none, syntax=:intel)

or simply:

@code_native debuginfo=:none syntax=:intel broadcast(min, a, b)

Keno · July 7, 2020, 8:00am

For the initial add case, the issue is that there’s no 3-element vector load CPU instruction (since that’s 96 bytes) - at least before AVX-512. Even AVX-512, it’s not clear that setting up the masks and doing things that way is necessarily faster. One small optimization we could make is to teach LLVM that there are actually 8 dereferenceable bytes before the pointer, so it would pack everything into the upper three elements of a four element vector, at least for the load. It’s a non-trivial change to LLVM though. You do also still need to do the extraction dance afterwards to get it into the correct memory location, since you can’t just overwrite those bits. In general however, I would caution against microbenchmarking here. These ABI concerns are most prominent at function boundaries and I wouldn’t be surprised if you saw better vectorization in your actual use case.

Elrod · July 7, 2020, 9:19am

FWIW, I normally find that it is. If the mask is known at compile time, it’d just be a kmov.
Some instructions (i.e. gather/scatter) consume the mask, but otherwise it is reusable.

AVX also has vmaskmov, meaning you don’t need AVX512 for this: VMASKMOV — Conditional SIMD Packed Loads and Stores
It isn’t as nice, but I’ve still found it to work well.

Using masked moves is an important part of LoopVectorization’s fast handling of cleanups, even when the mask has to be dynamically generated at runtime.

pauljurczak · July 7, 2020, 9:39am

For element wise SIMD operations, when SVector doesn’t fill the whole length of SIMD register, we could load the whole length from the memory anyway. For example, SVector{3, Float32} would load junk as the last float in <4 x float> (LLVM IR term), perform the operation and do masked store, which will disregard the junk end of SIMD register. I don’t know if LLVM IR makes it feasible.

Looking at LLVM input, I noticed that SVector already starts with much more complex code than SIMD.Vec:

julia> @code_llvm debuginfo=:none broadcast(min, a, b)

define void @julia_broadcast_20236([1 x [4 x float]]* noalias nocapture sret, [1 x [4 x float]] addrspace(11)* nocapture nonnull readonly dereferenceable(16), [1 x [4 x float]] addrspace(11)* nocapture nonnull readonly dereferenceable(16)) {
top:
  %3 = bitcast [1 x [4 x float]] addrspace(11)* %1 to <4 x float> addrspace(11)*
  %4 = load <4 x float>, <4 x float> addrspace(11)* %3, align 1
  %5 = bitcast [1 x [4 x float]] addrspace(11)* %2 to <4 x float> addrspace(11)*
  %6 = load <4 x float>, <4 x float> addrspace(11)* %5, align 1
  %7 = fcmp olt <4 x float> %6, %4
  %8 = bitcast <4 x float> %6 to <4 x i32>
  %9 = bitcast <4 x float> %4 to <4 x i32>
  %10 = icmp sgt <4 x i32> %9, <i32 -1, i32 -1, i32 -1, i32 -1>
  %11 = icmp slt <4 x i32> %8, zeroinitializer
  %12 = and <4 x i1> %10, %11
  %13 = or <4 x i1> %7, %12
  %14 = fcmp ord <4 x float> %4, zeroinitializer
  %15 = select <4 x i1> %14, <4 x float> %6, <4 x float> %4
  %16 = fcmp ord <4 x float> %6, zeroinitializer
  %17 = select <4 x i1> %16, <4 x float> %4, <4 x float> %6
  %18 = select <4 x i1> %13, <4 x float> %15, <4 x float> %17
  %19 = bitcast [1 x [4 x float]]* %0 to <4 x float>*
  store <4 x float> %18, <4 x float>* %19, align 4
  ret void
}

vs

julia> @code_llvm debuginfo=:none min(c, d)

define void @julia_min_20071([1 x <4 x float>]* noalias nocapture sret, [1 x <4 x float>] addrspace(11)* nocapture nonnull readonly dereferenceable(16), [1 x <4 x float>] addrspace(11)* nocapture nonnull readonly dereferenceable(16)) {
top:
  %3 = getelementptr inbounds [1 x <4 x float>], [1 x <4 x float>] addrspace(11)* %1, i64 0, i64 0
  %4 = getelementptr inbounds [1 x <4 x float>], [1 x <4 x float>] addrspace(11)* %2, i64 0, i64 0
  %5 = load <4 x float>, <4 x float> addrspace(11)* %3, align 16
  %6 = load <4 x float>, <4 x float> addrspace(11)* %4, align 16
  %res.i = call <4 x float> @llvm.minnum.v4f32(<4 x float> %5, <4 x float> %6)
  %7 = getelementptr inbounds [1 x <4 x float>], [1 x <4 x float>]* %0, i64 0, i64 0
  store <4 x float> %res.i, <4 x float>* %7, align 16
  ret void
}

The trickery of SVector code in lieu of just using llvm.minnum.v4f32 doesn’t pay off:

const n = 100

rng = MersenneTwister(314159)
a = [SIMD.Vec{4,Float32}((rand.(rng, (Float32, Float32, Float32, Float32)))) for _ in 1:n]
rng = MersenneTwister(314159)
b = [SVector{4, Float32}(rand.(rng, (Float32, Float32, Float32, Float32))) for _ in 1:n]

function fsimd(a)
  r = SIMD.Vec{4,Float32}((3, 1, 4, 1))
  for x in a
    r = min(r, x) 
  end
  return r
end

function fsvec(a)
  r = SVector{4, Float32}((3, 1, 4, 1))
  for x in a
    r = min.(r, x) 
  end
  return r
end

julia> @benchmark fsimd(a)
minimum time:     209.007 ns (0.00% GC)

julia> @benchmark fsvec(b)
minimum time:     259.254 ns (0.00% GC)

Topic		Replies	Views
SVector vs Vec usage: Why do I have an 8x speedup in a simple example? Performance	7	1041	August 17, 2019
Help understanding vectorization (or lack thereof) Performance	15	1221	June 8, 2018
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1889	June 27, 2021
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1107	February 22, 2023
Interesting post about SIMD dot product (and cosine similarity) Offtopic performance	17	866	December 2, 2024

Fast short Float32 vector

Related topics