# Sum(abs2, v) slow for complex vector

I have been thinking that `sum(abs2, v)` is a more efficient way to calculate the sum of the absolute squares of the entries of a vector `v` than `v⋅v`. It is indeed a case for a real vector:

``````julia> VERSION
v"0.6.3-pre.0"

julia> using BenchmarkTools

julia> v = rand(10000);

julia> @btime \$v⋅\$v;
2.659 μs (0 allocations: 0 bytes)

julia> @btime sum(abs2, \$v);
944.071 ns (0 allocations: 0 bytes)
``````

I thought `sum(abs2, v)` is an even better choice than `v⋅v` for a complex vector `v`, because the former returns a real number whereas the latter returns a complex number (with a zero imaginary part). On the contrary, `v⋅v` turns out to be faster for a complex vector:

``````julia> v = rand(Complex128, 10000);

julia> @btime \$v⋅\$v;
4.124 μs (1 allocation: 32 bytes)

julia> @btime sum(abs2, \$v);
12.554 μs (0 allocations: 0 bytes)
``````

Not sure exactly why this is happening. Any insights?

`simd` seems to need some help:

``````julia> using BenchmarkTools

julia> v = rand(Complex128, 1000); #only 1000, not 10_000

julia> @benchmark \$v' * \$v
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     593.893 ns (0.00% GC)
median time:      594.764 ns (0.00% GC)
mean time:        629.369 ns (0.00% GC)
maximum time:     1.130 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     178

julia> @benchmark sum(abs2, \$v)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.229 μs (0.00% GC)
median time:      1.230 μs (0.00% GC)
mean time:        1.267 μs (0.00% GC)
maximum time:     4.412 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> function self_dot(x::AbstractArray{Complex{T}}) where T
out = zero(T)
@simd for xi ∈ x
out += real(xi)*real(xi)+imag(xi)*imag(xi)
end
out
end
self_dot (generic function with 1 method)

julia> @benchmark self_dot(\$v)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.232 μs (0.00% GC)
median time:      1.233 μs (0.00% GC)
mean time:        1.291 μs (0.00% GC)
maximum time:     9.393 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> function self_dot2(x::AbstractArray{Complex{T}}) where T
out = zero(T)
n, r = divrem(length(x),2)
@inbounds begin
@simd for i ∈ 1:n
out += real(x[2i-1])*real(x[2i-1])+imag(x[2i-1])*imag(x[2i-1])+real(x[2i])*real(x[2i])+imag(x[2i])*imag(x[2i])
end
if r == 1
out += real(x[end])*real(x[end])+imag(x[end])*imag(x[end])
end
end
out
end
self_dot2 (generic function with 1 method)

julia> @benchmark self_dot2(\$v)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     704.069 ns (0.00% GC)
median time:      727.104 ns (0.00% GC)
mean time:        741.079 ns (0.00% GC)
maximum time:     1.056 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     144
``````

Still slower, but close.

1 Like

The scalar product seems to exploit the fact that it is multiplying a vector by itself:

``````julia> v = rand(Complex128, 10000);

julia> @btime \$v⋅\$v;
3.598 μs (1 allocation: 32 bytes)

julia> u = rand(Complex128, 10000);

julia> @btime \$v⋅\$u;
8.041 μs (1 allocation: 32 bytes)
``````

so I’m guessing that the BLAS routine checks for identical pointers and avoids computing the complex part when it doesn’t need to.

The fastest I’ve found is:

``````julia> Base.sum(::typeof(abs2), v::Array{Complex{T},1}) where {T<:Real} = sum(abs2, reinterpret(T, v))

julia> @btime sum(abs2, \$v);
2.283 μs (2 allocations: 80 bytes)
``````
4 Likes

I’m guessing the reason why `reinterpret`-ing as a real vector is faster is because the compiler is free to choose the order in which elements are summed and this enables effective use of SIMD instructions. Addition is not associative for floating point numbers, so when `abs2` explicitly adds elements pairwise inside the `sum` reduction, the compiler should not override that. (If the two numbers in a pair were of opposite sign, this could have a significant effect on the result, but here we know that they are both nonnegaitive.)