A reason to use --math-mode=fast
is that StaticArray’s otherwise has a lot of separate mul
s and add
s. Is it possible for there to be a flag that just says it is okay to fuse additions and multiplications, without opting for the less precise special functions?
Maybe a –math-mode=muladd, just like how that macro applies only that subset of what the fastmath macro allows.
A motivation:
julia> using StaticArrays
julia> A = @SMatrix randn(4,4);
julia> B = @SMatrix randn(4,4);
julia> @code_native A * B
.text
; ┌ @ matrix_multiply.jl:9 within `*'
; │┌ @ matrix_multiply.jl:75 within `_mul'
; ││┌ @ matrix_multiply.jl:78 within `macro expansion'
; │││┌ @ matrix_multiply.jl:123 within `mul_unrolled'
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ matrix_multiply.jl:9 within `*'
vbroadcastsd (%rdx), %ymm4
vmovupd (%rsi), %ymm3
vmovupd 32(%rsi), %ymm2
vmovupd 64(%rsi), %ymm1
vmovupd 96(%rsi), %ymm0
vmulpd %ymm4, %ymm3, %ymm4
vbroadcastsd 8(%rdx), %ymm5
vmulpd %ymm5, %ymm2, %ymm5
; ││││└└
; ││││┌ @ float.jl:395 within `macro expansion'
vaddpd %ymm5, %ymm4, %ymm4
; ││││└
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 16(%rdx), %ymm5
vmulpd %ymm5, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm5, %ymm4, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 24(%rdx), %ymm5
vmulpd %ymm5, %ymm0, %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm5, %ymm4, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 32(%rdx), %ymm5
vmulpd %ymm5, %ymm3, %ymm5
vbroadcastsd 40(%rdx), %ymm6
vmulpd %ymm6, %ymm2, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 48(%rdx), %ymm6
vmulpd %ymm6, %ymm1, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 56(%rdx), %ymm6
vmulpd %ymm6, %ymm0, %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm6, %ymm5, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 64(%rdx), %ymm6
vmulpd %ymm6, %ymm3, %ymm6
vbroadcastsd 72(%rdx), %ymm7
vmulpd %ymm7, %ymm2, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 80(%rdx), %ymm7
vmulpd %ymm7, %ymm1, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 88(%rdx), %ymm7
vmulpd %ymm7, %ymm0, %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm7, %ymm6, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 96(%rdx), %ymm7
vmulpd %ymm7, %ymm3, %ymm3
vbroadcastsd 104(%rdx), %ymm7
vmulpd %ymm7, %ymm2, %ymm2
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm2, %ymm3, %ymm2
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 112(%rdx), %ymm3
vmulpd %ymm3, %ymm1, %ymm1
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm1, %ymm2, %ymm1
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 120(%rdx), %ymm2
vmulpd %ymm2, %ymm0, %ymm0
; │││││└
; │││││┌ @ float.jl:395 within `+'
vaddpd %ymm0, %ymm1, %ymm0
; │└└└└└
vmovupd %ymm4, (%rdi)
vmovupd %ymm5, 32(%rdi)
vmovupd %ymm6, 64(%rdi)
vmovupd %ymm0, 96(%rdi)
movq %rdi, %rax
vzeroupper
retq
nopl (%rax)
;
vs starting with --math-mode=fast:
julia> using StaticArrays
julia> A = @SMatrix randn(4,4);
julia> B = @SMatrix randn(4,4);
julia> @code_native A * B
.text
; ┌ @ matrix_multiply.jl:9 within `*'
; │┌ @ matrix_multiply.jl:75 within `_mul'
; ││┌ @ matrix_multiply.jl:78 within `macro expansion'
; │││┌ @ matrix_multiply.jl:123 within `mul_unrolled'
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ matrix_multiply.jl:9 within `*'
vbroadcastsd (%rdx), %ymm4
vmovupd (%rsi), %ymm3
vmovupd 32(%rsi), %ymm2
vmovupd 64(%rsi), %ymm1
vmovupd 96(%rsi), %ymm0
vmulpd %ymm3, %ymm4, %ymm4
vbroadcastsd 8(%rdx), %ymm5
; ││││└└
; ││││┌ @ float.jl:395 within `macro expansion'
vfmadd213pd %ymm4, %ymm2, %ymm5
; ││││└
; ││││┌ @ matrix_multiply.jl:138 within `macro expansion'
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 16(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm1, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 24(%rdx), %ymm4
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm6, %ymm0, %ymm4
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 32(%rdx), %ymm5
vmulpd %ymm3, %ymm5, %ymm5
vbroadcastsd 40(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm2, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 48(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm6, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 56(%rdx), %ymm6
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm0, %ymm6
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 64(%rdx), %ymm5
vmulpd %ymm3, %ymm5, %ymm5
vbroadcastsd 72(%rdx), %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm2, %ymm7
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 80(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm7, %ymm1, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 88(%rdx), %ymm7
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm0, %ymm7
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 96(%rdx), %ymm5
vmulpd %ymm3, %ymm5, %ymm3
vbroadcastsd 104(%rdx), %ymm5
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm3, %ymm2, %ymm5
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 112(%rdx), %ymm2
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm5, %ymm1, %ymm2
; │││││└
; │││││┌ @ float.jl:399 within `*'
vbroadcastsd 120(%rdx), %ymm1
; │││││└
; │││││┌ @ float.jl:395 within `+'
vfmadd213pd %ymm2, %ymm0, %ymm1
; │└└└└└
vmovupd %ymm4, (%rdi)
vmovupd %ymm6, 32(%rdi)
vmovupd %ymm7, 64(%rdi)
vmovupd %ymm1, 96(%rdi)
movq %rdi, %rax
vzeroupper
retq
nopl (%rax,%rax)
; └
What does IEEE say about fma instructions? GCC 7.3 seemed willing to use vfmadd
with only -O3 -march=haswell
(but, it oddly only used SSE, no avx). That is no longer the case with gcc 8.2, however.
I had thought the fused operation should be at least as accurate.
In general, you should either do the relevant algebraic simplifications yourself instead of relying on the compiler to do it for you, or use much more pointed compiler hints like @simd
which allows re-association of floating operations in loops to allow them to be vectorized.
It is interesting that they both can have about the same effect on a dot procut.
julia> using BenchmarkTools
julia> x = randn(1024); y = randn(1024);
julia> function dot_product(x, y)
out = zero(promote_type(eltype(x),eltype(y)))
@inbounds for i ∈ eachindex(x,y)
out += x[i] * y[i]
end
out
end
dot_product (generic function with 1 method)
julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.799 μs (0.00% GC)
median time: 1.801 μs (0.00% GC)
mean time: 1.805 μs (0.00% GC)
maximum time: 4.146 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> function dot_product(x, y)
out = zero(promote_type(eltype(x),eltype(y)))
@inbounds @simd for i ∈ eachindex(x,y)
out += x[i] * y[i]
end
out
end
dot_product (generic function with 1 method)
julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 194.005 ns (0.00% GC)
median time: 194.193 ns (0.00% GC)
mean time: 196.170 ns (0.00% GC)
maximum time: 394.351 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 626
julia> function dot_product(x, y)
out = zero(promote_type(eltype(x),eltype(y)))
@fastmath @inbounds for i ∈ eachindex(x,y)
out += x[i] * y[i]
end
out
end
dot_product (generic function with 1 method)
julia> @benchmark dot_product($x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 194.174 ns (0.00% GC)
median time: 194.230 ns (0.00% GC)
mean time: 196.558 ns (0.00% GC)
maximum time: 335.214 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 626
julia> @benchmark $x' * $y
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 187.077 ns (0.00% GC)
median time: 187.396 ns (0.00% GC)
mean time: 189.133 ns (0.00% GC)
maximum time: 355.159 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 666
Aren’t these vectorizing loop transformations likely to make the answer more numerically accurate, than explicitly following the serial definition, because it accrues partial sums in parallel?
Or is it folly to try and divide these transformations into the groups “wont cause catastrophic problems” vs “might cause catastrophic problems”?