I have been thinking that sum(abs2, v) is a more efficient way to calculate the sum of the absolute squares of the entries of a vector v than v⋅v. It is indeed a case for a real vector:
julia> VERSION
v"0.6.3-pre.0"
julia> using BenchmarkTools
julia> v = rand(10000);
julia> @btime $v⋅$v;
2.659 μs (0 allocations: 0 bytes)
julia> @btime sum(abs2, $v);
944.071 ns (0 allocations: 0 bytes)
I thought sum(abs2, v) is an even better choice than v⋅v for a complex vector v, because the former returns a real number whereas the latter returns a complex number (with a zero imaginary part). On the contrary, v⋅v turns out to be faster for a complex vector:
julia> using BenchmarkTools
julia> v = rand(Complex128, 1000); #only 1000, not 10_000
julia> @benchmark $v' * $v
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 593.893 ns (0.00% GC)
median time: 594.764 ns (0.00% GC)
mean time: 629.369 ns (0.00% GC)
maximum time: 1.130 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 178
julia> @benchmark sum(abs2, $v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.229 μs (0.00% GC)
median time: 1.230 μs (0.00% GC)
mean time: 1.267 μs (0.00% GC)
maximum time: 4.412 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> function self_dot(x::AbstractArray{Complex{T}}) where T
out = zero(T)
@simd for xi ∈ x
out += real(xi)*real(xi)+imag(xi)*imag(xi)
end
out
end
self_dot (generic function with 1 method)
julia> @benchmark self_dot($v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.232 μs (0.00% GC)
median time: 1.233 μs (0.00% GC)
mean time: 1.291 μs (0.00% GC)
maximum time: 9.393 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> function self_dot2(x::AbstractArray{Complex{T}}) where T
out = zero(T)
n, r = divrem(length(x),2)
@inbounds begin
@simd for i ∈ 1:n
out += real(x[2i-1])*real(x[2i-1])+imag(x[2i-1])*imag(x[2i-1])+real(x[2i])*real(x[2i])+imag(x[2i])*imag(x[2i])
end
if r == 1
out += real(x[end])*real(x[end])+imag(x[end])*imag(x[end])
end
end
out
end
self_dot2 (generic function with 1 method)
julia> @benchmark self_dot2($v)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 704.069 ns (0.00% GC)
median time: 727.104 ns (0.00% GC)
mean time: 741.079 ns (0.00% GC)
maximum time: 1.056 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 144
I’m guessing the reason why reinterpret-ing as a real vector is faster is because the compiler is free to choose the order in which elements are summed and this enables effective use of SIMD instructions. Addition is not associative for floating point numbers, so when abs2 explicitly adds elements pairwise inside the sum reduction, the compiler should not override that. (If the two numbers in a pair were of opposite sign, this could have a significant effect on the result, but here we know that they are both nonnegaitive.)