Iβve come across this interesting observation in my package, CliffordNumbers.jl. For context, the multiplication done here is a geometric product (relevant code here) implemented as a grid multiply between blade coefficients (elements), and can be neatly implemented as a series of permutes, vectorized multiplies, and vectorized adds in principle. The CliffordNumbers{VGA(3),T} instances are backed by an NTuple{8,T}.
using CliffordNumbers, BenchmarkTools
x = CliffordNumber{VGA(3), Int64}(0, 4, 2, 0, 0, 0, 0, 0)
y = CliffordNumber{VGA(3), Int64}(0, 0, 0, 0, 0, 6, 9, 0)
# Convert the scalar entries to Float64
xx = scalar_convert(Float64, x)
yy = scalar_convert(Float64, y)
If I benchmark each of these multiplications, I get significantly different results:
julia> @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min β¦ max):  16.999 ns β¦ 400.718 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     17.777 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   23.070 ns Β±  15.576 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
  ββ
βββββββββββ  β                                             β
  ββββββββββββββββββββββββββββββββββββββββ
ββ
β
β
βββββββββ
β
β
β
βββ
β
 β
  17 ns         Histogram: log(frequency) by time      96.5 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark $xx * $yy
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min β¦ max):  6.087 ns β¦ 111.092 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
 Time  (median):     6.191 ns               β GC (median):    0.00%
 Time  (mean Β± Ο):   8.054 ns Β±   4.577 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%
  ββββ β   ββββ   β
  β          β                             β
  ββββββββ
ββββββββ
ββββββ
βββββ
β
β
βββ
ββ
β
ββββββββββββββ
β
βββββββββ β
  6.09 ns      Histogram: log(frequency) by time      25.5 ns <
 Memory estimate: 0 bytes, allocs estimate: 0.
I thought it was very weird that a) there was such a discrepancy, and b) that the Float64 case was much faster than the Int64 case. Looking at the machine code, I found the Int64 case fails to vectorize, even when the -O3 flag is used:
        .text
        .file   "*"
        .globl  "julia_*_3117"                  # -- Begin function julia_*_3117
        .p2align        4, 0x90
        .type   "julia_*_3117",@function
"julia_*_3117":                         # @"julia_*_3117"
# %bb.0:                                # %top
        push    rbp
        mov     rbp, rsp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 80
        mov     r10, rdx
        mov     rdx, rsi
        mov     qword ptr [rbp - 248], rdi      # 8-byte Spill
        mov     r11, qword ptr [r10 + 48]
        mov     r9, qword ptr [r10 + 40]
        mov     r8, qword ptr [r10 + 32]
        mov     rdi, qword ptr [r10 + 16]
        mov     r12, qword ptr [r10 + 8]
        mov     rsi, qword ptr [rsi + 32]
        mov     rcx, qword ptr [rdx + 24]
        mov     rax, r11
        mov     qword ptr [rbp - 136], rsi      # 8-byte Spill
        imul    rax, rsi
        mov     rbx, r12
        imul    rbx, rcx
        add     rbx, rax
        mov     qword ptr [rbp - 56], rbx       # 8-byte Spill
        mov     rax, rdi
        imul    rax, rsi
        mov     rsi, r9
        imul    rsi, rcx
        add     rsi, rax
        mov     qword ptr [rbp - 240], rsi      # 8-byte Spill
        mov     rax, rcx
        mov     rbx, rcx
        imul    rax, rdi
        mov     r15, rdi
        mov     qword ptr [rbp - 48], rdi       # 8-byte Spill
        mov     r14, qword ptr [rdx + 40]
        mov     rsi, r8
        mov     qword ptr [rbp - 64], r8        # 8-byte Spill
        imul    rsi, r14
        add     rsi, rax
        mov     qword ptr [rbp - 120], rsi      # 8-byte Spill
        mov     rcx, qword ptr [rdx + 16]
        mov     qword ptr [rbp - 96], rcx       # 8-byte Spill
        mov     rax, r12
        imul    rax, rcx
        mov     rdi, r11
        imul    rdi, r14
        add     rdi, rax
        mov     qword ptr [rbp - 88], rdi       # 8-byte Spill
        mov     rax, r11
        imul    rax, rbx
        mov     rdi, rbx
        mov     qword ptr [rbp - 192], rbx      # 8-byte Spill
        mov     r13, qword ptr [r10]
        mov     rbx, r13
        imul    rbx, r14
        add     rbx, rax
        mov     qword ptr [rbp - 224], rbx      # 8-byte Spill
        mov     rbx, r9
        mov     rax, r9
        imul    rax, rcx
        mov     rsi, r15
        imul    rsi, r14
        add     rsi, rax
        mov     qword ptr [rbp - 232], rsi      # 8-byte Spill
        mov     rax, qword ptr [rdx + 56]
        mov     qword ptr [rbp - 104], rax      # 8-byte Spill
        mov     r15, qword ptr [r10 + 56]
        mov     r9, r15
        imul    r9, rax
        mov     rcx, qword ptr [rdx + 48]
        mov     qword ptr [rbp - 152], rcx      # 8-byte Spill
        mov     rax, r11
        imul    rax, rcx
        add     rax, r9
        mov     r9, rbx
        mov     rcx, rbx
        imul    r9, r14
        add     r9, rax
        mov     r8, qword ptr [r10 + 24]
        mov     r10, rdi
        imul    r10, r8
        add     r10, r9
        mov     rsi, qword ptr [rbp - 64]       # 8-byte Reload
        mov     r9, rsi
        mov     rbx, qword ptr [rbp - 136]      # 8-byte Reload
        imul    r9, rbx
        sub     r9, r10
        mov     rax, qword ptr [rbp - 48]       # 8-byte Reload
        mov     rdi, rax
        imul    rdi, qword ptr [rbp - 96]       # 8-byte Folded Reload
        add     r9, rdi
        mov     r10, qword ptr [rdx + 8]
        mov     rdi, r12
        imul    rdi, r10
        add     r9, rdi
        mov     rdx, qword ptr [rdx]
        mov     qword ptr [rbp - 128], rdx      # 8-byte Spill
        mov     rdi, r13
        imul    rdi, rdx
        add     r9, rdi
        mov     rdi, r13
        imul    rdi, r10
        mov     rdx, r8
        imul    rdx, r10
        mov     qword ptr [rbp - 72], rdx       # 8-byte Spill
        mov     rdx, rax
        imul    rdx, r10
        mov     qword ptr [rbp - 160], rdx      # 8-byte Spill
        mov     rax, rcx
        imul    rax, r10
        mov     qword ptr [rbp - 80], rax       # 8-byte Spill
        mov     rax, rsi
        imul    rax, r10
        mov     qword ptr [rbp - 200], rax      # 8-byte Spill
        mov     rax, r15
        imul    rax, r10
        mov     qword ptr [rbp - 216], rax      # 8-byte Spill
        imul    r10, r11
        mov     qword ptr [rbp - 208], r10      # 8-byte Spill
        mov     qword ptr [rbp - 112], r11      # 8-byte Spill
        mov     qword ptr [rbp - 184], r11      # 8-byte Spill
        imul    r11, qword ptr [rbp - 104]      # 8-byte Folded Reload
        mov     rdx, r15
        mov     r10, qword ptr [rbp - 152]      # 8-byte Reload
        imul    rdx, r10
        add     rdx, r11
        mov     r11, rcx
        imul    r11, rbx
        add     r11, rdx
        mov     rdx, r8
        mov     rsi, qword ptr [rbp - 96]       # 8-byte Reload
        imul    rdx, rsi
        add     rdx, r11
        mov     r11, qword ptr [rbp - 120]      # 8-byte Reload
        sub     r11, rdx
        add     r11, rdi
        mov     rdx, r12
        mov     rax, qword ptr [rbp - 128]      # 8-byte Reload
        imul    rdx, rax
        add     r11, rdx
        mov     qword ptr [rbp - 120], r11      # 8-byte Spill
        mov     rdx, qword ptr [rbp - 64]       # 8-byte Reload
        imul    rdx, r10
        mov     rdi, rcx
        mov     qword ptr [rbp - 176], rcx      # 8-byte Spill
        mov     r11, qword ptr [rbp - 104]      # 8-byte Reload
        imul    rdi, r11
        add     rdi, rdx
        mov     rdx, r15
        imul    rdx, r14
        add     rdi, rdx
        sub     rdi, qword ptr [rbp - 56]       # 8-byte Folded Reload
        mov     rdx, r13
        imul    rdx, rsi
        add     rdi, rdx
        add     rdi, qword ptr [rbp - 72]       # 8-byte Folded Reload
        mov     rsi, qword ptr [rbp - 48]       # 8-byte Reload
        mov     rdx, rsi
        imul    rdx, rax
        add     rdi, rdx
        mov     qword ptr [rbp - 72], rdi       # 8-byte Spill
        imul    rcx, r10
        mov     rax, qword ptr [rbp - 64]       # 8-byte Reload
        mov     qword ptr [rbp - 144], rax      # 8-byte Spill
        mov     qword ptr [rbp - 168], rax      # 8-byte Spill
        mov     qword ptr [rbp - 56], rax       # 8-byte Spill
        mov     rdx, r11
        imul    rax, r11
        add     rax, rcx
        mov     rdi, r15
        imul    rdi, rbx
        add     rax, rdi
        mov     rdi, r13
        mov     r11, qword ptr [rbp - 192]      # 8-byte Reload
        imul    rdi, r11
        add     rax, rdi
        sub     rax, qword ptr [rbp - 88]       # 8-byte Folded Reload
        add     rax, qword ptr [rbp - 160]      # 8-byte Folded Reload
        mov     rdi, r8
        mov     rbx, qword ptr [rbp - 128]      # 8-byte Reload
        imul    rdi, rbx
        add     rax, rdi
        mov     qword ptr [rbp - 64], rax       # 8-byte Spill
        mov     rdi, r8
        imul    rdi, rdx
        imul    rsi, r10
        add     rsi, rdi
        mov     rax, r8
        imul    rax, r14
        mov     qword ptr [rbp - 88], rax       # 8-byte Spill
        imul    r14, r12
        add     r14, rsi
        mov     rdx, qword ptr [rbp - 56]       # 8-byte Reload
        imul    rdx, r11
        mov     qword ptr [rbp - 56], rdx       # 8-byte Spill
        imul    r11, r15
        add     r11, r14
        mov     rcx, r13
        mov     rax, qword ptr [rbp - 136]      # 8-byte Reload
        imul    rcx, rax
        sub     rcx, r11
        mov     rdx, qword ptr [rbp - 184]      # 8-byte Reload
        mov     r11, qword ptr [rbp - 96]       # 8-byte Reload
        imul    rdx, r11
        add     rcx, rdx
        add     rcx, qword ptr [rbp - 80]       # 8-byte Folded Reload
        mov     rdx, qword ptr [rbp - 144]      # 8-byte Reload
        imul    rdx, rbx
        add     rcx, rdx
        mov     rdx, r13
        imul    rdx, r10
        mov     rsi, r12
        imul    rsi, r10
        mov     qword ptr [rbp - 80], rsi       # 8-byte Spill
        mov     rsi, r8
        imul    r8, r10
        mov     rdi, qword ptr [rbp - 48]       # 8-byte Reload
        mov     r14, qword ptr [rbp - 104]      # 8-byte Reload
        imul    rdi, r14
        add     r8, rdi
        imul    rsi, rax
        mov     qword ptr [rbp - 48], rsi       # 8-byte Spill
        imul    rax, r12
        add     rax, r8
        mov     rdi, qword ptr [rbp - 176]      # 8-byte Reload
        imul    rdi, rbx
        mov     rsi, qword ptr [rbp - 112]      # 8-byte Reload
        imul    rsi, rbx
        mov     qword ptr [rbp - 112], rsi      # 8-byte Spill
        imul    rbx, r15
        mov     r10, qword ptr [rbp - 168]      # 8-byte Reload
        imul    r10, r11
        imul    r15, r11
        add     r15, rax
        mov     rsi, qword ptr [rbp - 224]      # 8-byte Reload
        sub     rsi, r15
        add     rsi, qword ptr [rbp - 200]      # 8-byte Folded Reload
        add     rsi, rdi
        mov     rax, r14
        imul    r12, r14
        add     r12, rdx
        add     r12, qword ptr [rbp - 88]       # 8-byte Folded Reload
        sub     r12, qword ptr [rbp - 240]      # 8-byte Folded Reload
        add     r12, r10
        add     r12, qword ptr [rbp - 216]      # 8-byte Folded Reload
        add     r12, qword ptr [rbp - 112]      # 8-byte Folded Reload
        imul    r13, rax
        add     r13, qword ptr [rbp - 80]       # 8-byte Folded Reload
        add     r13, qword ptr [rbp - 48]       # 8-byte Folded Reload
        add     r13, qword ptr [rbp - 56]       # 8-byte Folded Reload
        sub     r13, qword ptr [rbp - 232]      # 8-byte Folded Reload
        add     r13, qword ptr [rbp - 208]      # 8-byte Folded Reload
        add     r13, rbx
        mov     rax, qword ptr [rbp - 248]      # 8-byte Reload
        mov     qword ptr [rax], r9
        mov     rdx, qword ptr [rbp - 120]      # 8-byte Reload
        mov     qword ptr [rax + 8], rdx
        mov     rdx, qword ptr [rbp - 72]       # 8-byte Reload
        mov     qword ptr [rax + 16], rdx
        mov     rdx, qword ptr [rbp - 64]       # 8-byte Reload
        mov     qword ptr [rax + 24], rdx
        mov     qword ptr [rax + 32], rcx
        mov     qword ptr [rax + 40], rsi
        mov     qword ptr [rax + 48], r12
        mov     qword ptr [rax + 56], r13
        add     rsp, 80
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret
.Lfunc_end0:
        .size   "julia_*_3117", .Lfunc_end0-"julia_*_3117"
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
The Float64 case vectorizes as expected:
        .text
        .file   "*"
        .globl  "julia_*_3119"                  # -- Begin function julia_*_3119
        .p2align        4, 0x90
        .type   "julia_*_3119",@function
"julia_*_3119":                         # @"julia_*_3119"
# %bb.0:                                # %top
        push    rbp
        mov     rbp, rsp
        vmovupd ymm6, ymmword ptr [rdx]
        mov     rax, rdi
        vmovupd ymm0, ymmword ptr [rdx + 32]
        vbroadcastsd    ymm2, qword ptr [rsi]
        vxorpd  xmm19, xmm19, xmm19
        vpermilpd       ymm1, ymm6, 5           # ymm1 = ymm6[1,0,3,2]
        vpermpd ymm4, ymm6, 78                  # ymm4 = ymm6[2,3,0,1]
        vpermpd ymm5, ymm6, 27                  # ymm5 = ymm6[3,2,1,0]
        vbroadcastsd    ymm7, qword ptr [rsi + 32]
        vmulpd  ymm8, ymm7, ymm6
        vfmadd213pd     ymm6, ymm2, ymm19       # ymm6 = (ymm2 * ymm6) + ymm19
        vbroadcastsd    ymm9, qword ptr [rsi + 8]
        vmulpd  ymm10, ymm9, ymm1
        vbroadcastsd    ymm11, qword ptr [rsi + 16]
        vaddpd  ymm6, ymm10, ymm6
        vmulpd  ymm10, ymm11, ymm4
        vaddpd  ymm12, ymm10, ymm6
        vsubpd  ymm6, ymm6, ymm10
        vbroadcastsd    ymm10, qword ptr [rsi + 24]
        vmulpd  ymm13, ymm10, ymm5
        vsubpd  ymm12, ymm12, ymm13
        vaddpd  ymm6, ymm13, ymm6
        vblendpd        ymm6, ymm12, ymm6, 10           # ymm6 = ymm12[0],ymm6[1],ymm12[2],ymm6[3]
        vmulpd  ymm7, ymm7, ymm0
        vaddpd  ymm12, ymm6, ymm7
        vsubpd  ymm6, ymm6, ymm7
        vbroadcastsd    ymm7, qword ptr [rsi + 40]
        vpermilpd       ymm13, ymm0, 5          # ymm13 = ymm0[1,0,3,2]
        vmulpd  ymm14, ymm13, ymm7
        vsubpd  ymm12, ymm12, ymm14
        vaddpd  ymm6, ymm14, ymm6
        vblendpd        ymm6, ymm12, ymm6, 6            # ymm6 = ymm12[0],ymm6[1,2],ymm12[3]
        vbroadcastsd    ymm12, qword ptr [rsi + 48]
        vpermpd ymm14, ymm0, 78                 # ymm14 = ymm0[2,3,0,1]
        vmulpd  ymm15, ymm12, ymm14
        vsubpd  ymm16, ymm6, ymm15
        vaddpd  ymm6, ymm15, ymm6
        vbroadcastsd    ymm15, qword ptr [rsi + 56]
        vpermpd ymm17, ymm0, 27                 # ymm17 = ymm0[3,2,1,0]
        vmulpd  ymm18, ymm15, ymm17
        vsubpd  ymm3, ymm16, ymm18
        vaddpd  ymm6, ymm6, ymm18
        vblendpd        ymm3, ymm3, ymm6, 12            # ymm3 = ymm3[0,1],ymm6[2,3]
        vmovupd ymmword ptr [rdi], ymm3
        vfmadd213pd     ymm0, ymm2, ymm19       # ymm0 = (ymm2 * ymm0) + ymm19
        vmulpd  ymm2, ymm9, ymm13
        vaddpd  ymm0, ymm2, ymm0
        vmulpd  ymm2, ymm11, ymm14
        vaddpd  ymm3, ymm0, ymm2
        vsubpd  ymm0, ymm0, ymm2
        vmulpd  ymm2, ymm10, ymm17
        vsubpd  ymm3, ymm3, ymm2
        vaddpd  ymm0, ymm0, ymm2
        vblendpd        ymm0, ymm3, ymm0, 10            # ymm0 = ymm3[0],ymm0[1],ymm3[2],ymm0[3]
        vaddpd  ymm2, ymm8, ymm0
        vsubpd  ymm0, ymm0, ymm8
        vmulpd  ymm1, ymm7, ymm1
        vsubpd  ymm2, ymm2, ymm1
        vaddpd  ymm0, ymm0, ymm1
        vblendpd        ymm0, ymm2, ymm0, 6             # ymm0 = ymm2[0],ymm0[1,2],ymm2[3]
        vmulpd  ymm1, ymm12, ymm4
        vsubpd  ymm2, ymm0, ymm1
        vaddpd  ymm0, ymm0, ymm1
        vmulpd  ymm1, ymm15, ymm5
        vsubpd  ymm2, ymm2, ymm1
        vaddpd  ymm0, ymm0, ymm1
        vblendpd        ymm0, ymm2, ymm0, 12            # ymm0 = ymm2[0,1],ymm0[2,3]
        vmovupd ymmword ptr [rdi + 32], ymm0
        pop     rbp
        vzeroupper
        ret
.Lfunc_end0:
        .size   "julia_*_3119", .Lfunc_end0-"julia_*_3119"
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits
So my question is: considering the only difference between x and xx (and y and yy) is the element type, why is the assembly output so starkly different?
I suspect this may have something to do with the fact that both data types are 512 bits wide (and donβt fit into an AVX-256 register), but that shouldnβt make a difference, in principle. The reason I suspect this is because if I use smaller data types, like CliffordNumber{VGA(2),T} or EvenCliffordNumber{VGA(3),T} (which have 4 elements each), the T === Int64 case always vectorizes as the T === Float64 case does.
This performance difference wonβt cause me any serious issues at the moment; Iβm just interested in why this is happening.