Iβve come across this interesting observation in my package, CliffordNumbers.jl. For context, the multiplication done here is a geometric product (relevant code here) implemented as a grid multiply between blade coefficients (elements), and can be neatly implemented as a series of permutes, vectorized multiplies, and vectorized adds in principle. The CliffordNumbers{VGA(3),T}
instances are backed by an NTuple{8,T}
.
using CliffordNumbers, BenchmarkTools
x = CliffordNumber{VGA(3), Int64}(0, 4, 2, 0, 0, 0, 0, 0)
y = CliffordNumber{VGA(3), Int64}(0, 0, 0, 0, 0, 6, 9, 0)
# Convert the scalar entries to Float64
xx = scalar_convert(Float64, x)
yy = scalar_convert(Float64, y)
If I benchmark each of these multiplications, I get significantly different results:
julia> @benchmark $x * $y
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
Range (min β¦ max): 16.999 ns β¦ 400.718 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 17.777 ns β GC (median): 0.00%
Time (mean Β± Ο): 23.070 ns Β± 15.576 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββββββββ β β
ββββββββββββββββββββββββββββββββββββββββ
ββ
β
β
βββββββββ
β
β
β
βββ
β
β
17 ns Histogram: log(frequency) by time 96.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark $xx * $yy
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 6.087 ns β¦ 111.092 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 6.191 ns β GC (median): 0.00%
Time (mean Β± Ο): 8.054 ns Β± 4.577 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ β ββββ β
β β β
ββββββββ
ββββββββ
ββββββ
βββββ
β
β
βββ
ββ
β
ββββββββββββββ
β
βββββββββ β
6.09 ns Histogram: log(frequency) by time 25.5 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
I thought it was very weird that a) there was such a discrepancy, and b) that the Float64
case was much faster than the Int64
case. Looking at the machine code, I found the Int64
case fails to vectorize, even when the -O3
flag is used:
.text
.file "*"
.globl "julia_*_3117" # -- Begin function julia_*_3117
.p2align 4, 0x90
.type "julia_*_3117",@function
"julia_*_3117": # @"julia_*_3117"
# %bb.0: # %top
push rbp
mov rbp, rsp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 80
mov r10, rdx
mov rdx, rsi
mov qword ptr [rbp - 248], rdi # 8-byte Spill
mov r11, qword ptr [r10 + 48]
mov r9, qword ptr [r10 + 40]
mov r8, qword ptr [r10 + 32]
mov rdi, qword ptr [r10 + 16]
mov r12, qword ptr [r10 + 8]
mov rsi, qword ptr [rsi + 32]
mov rcx, qword ptr [rdx + 24]
mov rax, r11
mov qword ptr [rbp - 136], rsi # 8-byte Spill
imul rax, rsi
mov rbx, r12
imul rbx, rcx
add rbx, rax
mov qword ptr [rbp - 56], rbx # 8-byte Spill
mov rax, rdi
imul rax, rsi
mov rsi, r9
imul rsi, rcx
add rsi, rax
mov qword ptr [rbp - 240], rsi # 8-byte Spill
mov rax, rcx
mov rbx, rcx
imul rax, rdi
mov r15, rdi
mov qword ptr [rbp - 48], rdi # 8-byte Spill
mov r14, qword ptr [rdx + 40]
mov rsi, r8
mov qword ptr [rbp - 64], r8 # 8-byte Spill
imul rsi, r14
add rsi, rax
mov qword ptr [rbp - 120], rsi # 8-byte Spill
mov rcx, qword ptr [rdx + 16]
mov qword ptr [rbp - 96], rcx # 8-byte Spill
mov rax, r12
imul rax, rcx
mov rdi, r11
imul rdi, r14
add rdi, rax
mov qword ptr [rbp - 88], rdi # 8-byte Spill
mov rax, r11
imul rax, rbx
mov rdi, rbx
mov qword ptr [rbp - 192], rbx # 8-byte Spill
mov r13, qword ptr [r10]
mov rbx, r13
imul rbx, r14
add rbx, rax
mov qword ptr [rbp - 224], rbx # 8-byte Spill
mov rbx, r9
mov rax, r9
imul rax, rcx
mov rsi, r15
imul rsi, r14
add rsi, rax
mov qword ptr [rbp - 232], rsi # 8-byte Spill
mov rax, qword ptr [rdx + 56]
mov qword ptr [rbp - 104], rax # 8-byte Spill
mov r15, qword ptr [r10 + 56]
mov r9, r15
imul r9, rax
mov rcx, qword ptr [rdx + 48]
mov qword ptr [rbp - 152], rcx # 8-byte Spill
mov rax, r11
imul rax, rcx
add rax, r9
mov r9, rbx
mov rcx, rbx
imul r9, r14
add r9, rax
mov r8, qword ptr [r10 + 24]
mov r10, rdi
imul r10, r8
add r10, r9
mov rsi, qword ptr [rbp - 64] # 8-byte Reload
mov r9, rsi
mov rbx, qword ptr [rbp - 136] # 8-byte Reload
imul r9, rbx
sub r9, r10
mov rax, qword ptr [rbp - 48] # 8-byte Reload
mov rdi, rax
imul rdi, qword ptr [rbp - 96] # 8-byte Folded Reload
add r9, rdi
mov r10, qword ptr [rdx + 8]
mov rdi, r12
imul rdi, r10
add r9, rdi
mov rdx, qword ptr [rdx]
mov qword ptr [rbp - 128], rdx # 8-byte Spill
mov rdi, r13
imul rdi, rdx
add r9, rdi
mov rdi, r13
imul rdi, r10
mov rdx, r8
imul rdx, r10
mov qword ptr [rbp - 72], rdx # 8-byte Spill
mov rdx, rax
imul rdx, r10
mov qword ptr [rbp - 160], rdx # 8-byte Spill
mov rax, rcx
imul rax, r10
mov qword ptr [rbp - 80], rax # 8-byte Spill
mov rax, rsi
imul rax, r10
mov qword ptr [rbp - 200], rax # 8-byte Spill
mov rax, r15
imul rax, r10
mov qword ptr [rbp - 216], rax # 8-byte Spill
imul r10, r11
mov qword ptr [rbp - 208], r10 # 8-byte Spill
mov qword ptr [rbp - 112], r11 # 8-byte Spill
mov qword ptr [rbp - 184], r11 # 8-byte Spill
imul r11, qword ptr [rbp - 104] # 8-byte Folded Reload
mov rdx, r15
mov r10, qword ptr [rbp - 152] # 8-byte Reload
imul rdx, r10
add rdx, r11
mov r11, rcx
imul r11, rbx
add r11, rdx
mov rdx, r8
mov rsi, qword ptr [rbp - 96] # 8-byte Reload
imul rdx, rsi
add rdx, r11
mov r11, qword ptr [rbp - 120] # 8-byte Reload
sub r11, rdx
add r11, rdi
mov rdx, r12
mov rax, qword ptr [rbp - 128] # 8-byte Reload
imul rdx, rax
add r11, rdx
mov qword ptr [rbp - 120], r11 # 8-byte Spill
mov rdx, qword ptr [rbp - 64] # 8-byte Reload
imul rdx, r10
mov rdi, rcx
mov qword ptr [rbp - 176], rcx # 8-byte Spill
mov r11, qword ptr [rbp - 104] # 8-byte Reload
imul rdi, r11
add rdi, rdx
mov rdx, r15
imul rdx, r14
add rdi, rdx
sub rdi, qword ptr [rbp - 56] # 8-byte Folded Reload
mov rdx, r13
imul rdx, rsi
add rdi, rdx
add rdi, qword ptr [rbp - 72] # 8-byte Folded Reload
mov rsi, qword ptr [rbp - 48] # 8-byte Reload
mov rdx, rsi
imul rdx, rax
add rdi, rdx
mov qword ptr [rbp - 72], rdi # 8-byte Spill
imul rcx, r10
mov rax, qword ptr [rbp - 64] # 8-byte Reload
mov qword ptr [rbp - 144], rax # 8-byte Spill
mov qword ptr [rbp - 168], rax # 8-byte Spill
mov qword ptr [rbp - 56], rax # 8-byte Spill
mov rdx, r11
imul rax, r11
add rax, rcx
mov rdi, r15
imul rdi, rbx
add rax, rdi
mov rdi, r13
mov r11, qword ptr [rbp - 192] # 8-byte Reload
imul rdi, r11
add rax, rdi
sub rax, qword ptr [rbp - 88] # 8-byte Folded Reload
add rax, qword ptr [rbp - 160] # 8-byte Folded Reload
mov rdi, r8
mov rbx, qword ptr [rbp - 128] # 8-byte Reload
imul rdi, rbx
add rax, rdi
mov qword ptr [rbp - 64], rax # 8-byte Spill
mov rdi, r8
imul rdi, rdx
imul rsi, r10
add rsi, rdi
mov rax, r8
imul rax, r14
mov qword ptr [rbp - 88], rax # 8-byte Spill
imul r14, r12
add r14, rsi
mov rdx, qword ptr [rbp - 56] # 8-byte Reload
imul rdx, r11
mov qword ptr [rbp - 56], rdx # 8-byte Spill
imul r11, r15
add r11, r14
mov rcx, r13
mov rax, qword ptr [rbp - 136] # 8-byte Reload
imul rcx, rax
sub rcx, r11
mov rdx, qword ptr [rbp - 184] # 8-byte Reload
mov r11, qword ptr [rbp - 96] # 8-byte Reload
imul rdx, r11
add rcx, rdx
add rcx, qword ptr [rbp - 80] # 8-byte Folded Reload
mov rdx, qword ptr [rbp - 144] # 8-byte Reload
imul rdx, rbx
add rcx, rdx
mov rdx, r13
imul rdx, r10
mov rsi, r12
imul rsi, r10
mov qword ptr [rbp - 80], rsi # 8-byte Spill
mov rsi, r8
imul r8, r10
mov rdi, qword ptr [rbp - 48] # 8-byte Reload
mov r14, qword ptr [rbp - 104] # 8-byte Reload
imul rdi, r14
add r8, rdi
imul rsi, rax
mov qword ptr [rbp - 48], rsi # 8-byte Spill
imul rax, r12
add rax, r8
mov rdi, qword ptr [rbp - 176] # 8-byte Reload
imul rdi, rbx
mov rsi, qword ptr [rbp - 112] # 8-byte Reload
imul rsi, rbx
mov qword ptr [rbp - 112], rsi # 8-byte Spill
imul rbx, r15
mov r10, qword ptr [rbp - 168] # 8-byte Reload
imul r10, r11
imul r15, r11
add r15, rax
mov rsi, qword ptr [rbp - 224] # 8-byte Reload
sub rsi, r15
add rsi, qword ptr [rbp - 200] # 8-byte Folded Reload
add rsi, rdi
mov rax, r14
imul r12, r14
add r12, rdx
add r12, qword ptr [rbp - 88] # 8-byte Folded Reload
sub r12, qword ptr [rbp - 240] # 8-byte Folded Reload
add r12, r10
add r12, qword ptr [rbp - 216] # 8-byte Folded Reload
add r12, qword ptr [rbp - 112] # 8-byte Folded Reload
imul r13, rax
add r13, qword ptr [rbp - 80] # 8-byte Folded Reload
add r13, qword ptr [rbp - 48] # 8-byte Folded Reload
add r13, qword ptr [rbp - 56] # 8-byte Folded Reload
sub r13, qword ptr [rbp - 232] # 8-byte Folded Reload
add r13, qword ptr [rbp - 208] # 8-byte Folded Reload
add r13, rbx
mov rax, qword ptr [rbp - 248] # 8-byte Reload
mov qword ptr [rax], r9
mov rdx, qword ptr [rbp - 120] # 8-byte Reload
mov qword ptr [rax + 8], rdx
mov rdx, qword ptr [rbp - 72] # 8-byte Reload
mov qword ptr [rax + 16], rdx
mov rdx, qword ptr [rbp - 64] # 8-byte Reload
mov qword ptr [rax + 24], rdx
mov qword ptr [rax + 32], rcx
mov qword ptr [rax + 40], rsi
mov qword ptr [rax + 48], r12
mov qword ptr [rax + 56], r13
add rsp, 80
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
.Lfunc_end0:
.size "julia_*_3117", .Lfunc_end0-"julia_*_3117"
# -- End function
.section ".note.GNU-stack","",@progbits
The Float64
case vectorizes as expected:
.text
.file "*"
.globl "julia_*_3119" # -- Begin function julia_*_3119
.p2align 4, 0x90
.type "julia_*_3119",@function
"julia_*_3119": # @"julia_*_3119"
# %bb.0: # %top
push rbp
mov rbp, rsp
vmovupd ymm6, ymmword ptr [rdx]
mov rax, rdi
vmovupd ymm0, ymmword ptr [rdx + 32]
vbroadcastsd ymm2, qword ptr [rsi]
vxorpd xmm19, xmm19, xmm19
vpermilpd ymm1, ymm6, 5 # ymm1 = ymm6[1,0,3,2]
vpermpd ymm4, ymm6, 78 # ymm4 = ymm6[2,3,0,1]
vpermpd ymm5, ymm6, 27 # ymm5 = ymm6[3,2,1,0]
vbroadcastsd ymm7, qword ptr [rsi + 32]
vmulpd ymm8, ymm7, ymm6
vfmadd213pd ymm6, ymm2, ymm19 # ymm6 = (ymm2 * ymm6) + ymm19
vbroadcastsd ymm9, qword ptr [rsi + 8]
vmulpd ymm10, ymm9, ymm1
vbroadcastsd ymm11, qword ptr [rsi + 16]
vaddpd ymm6, ymm10, ymm6
vmulpd ymm10, ymm11, ymm4
vaddpd ymm12, ymm10, ymm6
vsubpd ymm6, ymm6, ymm10
vbroadcastsd ymm10, qword ptr [rsi + 24]
vmulpd ymm13, ymm10, ymm5
vsubpd ymm12, ymm12, ymm13
vaddpd ymm6, ymm13, ymm6
vblendpd ymm6, ymm12, ymm6, 10 # ymm6 = ymm12[0],ymm6[1],ymm12[2],ymm6[3]
vmulpd ymm7, ymm7, ymm0
vaddpd ymm12, ymm6, ymm7
vsubpd ymm6, ymm6, ymm7
vbroadcastsd ymm7, qword ptr [rsi + 40]
vpermilpd ymm13, ymm0, 5 # ymm13 = ymm0[1,0,3,2]
vmulpd ymm14, ymm13, ymm7
vsubpd ymm12, ymm12, ymm14
vaddpd ymm6, ymm14, ymm6
vblendpd ymm6, ymm12, ymm6, 6 # ymm6 = ymm12[0],ymm6[1,2],ymm12[3]
vbroadcastsd ymm12, qword ptr [rsi + 48]
vpermpd ymm14, ymm0, 78 # ymm14 = ymm0[2,3,0,1]
vmulpd ymm15, ymm12, ymm14
vsubpd ymm16, ymm6, ymm15
vaddpd ymm6, ymm15, ymm6
vbroadcastsd ymm15, qword ptr [rsi + 56]
vpermpd ymm17, ymm0, 27 # ymm17 = ymm0[3,2,1,0]
vmulpd ymm18, ymm15, ymm17
vsubpd ymm3, ymm16, ymm18
vaddpd ymm6, ymm6, ymm18
vblendpd ymm3, ymm3, ymm6, 12 # ymm3 = ymm3[0,1],ymm6[2,3]
vmovupd ymmword ptr [rdi], ymm3
vfmadd213pd ymm0, ymm2, ymm19 # ymm0 = (ymm2 * ymm0) + ymm19
vmulpd ymm2, ymm9, ymm13
vaddpd ymm0, ymm2, ymm0
vmulpd ymm2, ymm11, ymm14
vaddpd ymm3, ymm0, ymm2
vsubpd ymm0, ymm0, ymm2
vmulpd ymm2, ymm10, ymm17
vsubpd ymm3, ymm3, ymm2
vaddpd ymm0, ymm0, ymm2
vblendpd ymm0, ymm3, ymm0, 10 # ymm0 = ymm3[0],ymm0[1],ymm3[2],ymm0[3]
vaddpd ymm2, ymm8, ymm0
vsubpd ymm0, ymm0, ymm8
vmulpd ymm1, ymm7, ymm1
vsubpd ymm2, ymm2, ymm1
vaddpd ymm0, ymm0, ymm1
vblendpd ymm0, ymm2, ymm0, 6 # ymm0 = ymm2[0],ymm0[1,2],ymm2[3]
vmulpd ymm1, ymm12, ymm4
vsubpd ymm2, ymm0, ymm1
vaddpd ymm0, ymm0, ymm1
vmulpd ymm1, ymm15, ymm5
vsubpd ymm2, ymm2, ymm1
vaddpd ymm0, ymm0, ymm1
vblendpd ymm0, ymm2, ymm0, 12 # ymm0 = ymm2[0,1],ymm0[2,3]
vmovupd ymmword ptr [rdi + 32], ymm0
pop rbp
vzeroupper
ret
.Lfunc_end0:
.size "julia_*_3119", .Lfunc_end0-"julia_*_3119"
# -- End function
.section ".note.GNU-stack","",@progbits
So my question is: considering the only difference between x
and xx
(and y
and yy
) is the element type, why is the assembly output so starkly different?
I suspect this may have something to do with the fact that both data types are 512 bits wide (and donβt fit into an AVX-256 register), but that shouldnβt make a difference, in principle. The reason I suspect this is because if I use smaller data types, like CliffordNumber{VGA(2),T}
or EvenCliffordNumber{VGA(3),T}
(which have 4 elements each), the T === Int64
case always vectorizes as the T === Float64
case does.
This performance difference wonβt cause me any serious issues at the moment; Iβm just interested in why this is happening.