Sum(abs2, v) slow for complex vector


#1

I have been thinking that sum(abs2, v) is a more efficient way to calculate the sum of the absolute squares of the entries of a vector v than v⋅v. It is indeed a case for a real vector:

julia> VERSION
v"0.6.3-pre.0"

julia> using BenchmarkTools

julia> v = rand(10000);

julia> @btime $v⋅$v;
  2.659 μs (0 allocations: 0 bytes)

julia> @btime sum(abs2, $v);
  944.071 ns (0 allocations: 0 bytes)

I thought sum(abs2, v) is an even better choice than v⋅v for a complex vector v, because the former returns a real number whereas the latter returns a complex number (with a zero imaginary part). On the contrary, v⋅v turns out to be faster for a complex vector:

julia> v = rand(Complex128, 10000);

julia> @btime $v⋅$v;
  4.124 μs (1 allocation: 32 bytes)

julia> @btime sum(abs2, $v);
  12.554 μs (0 allocations: 0 bytes)

Not sure exactly why this is happening. Any insights?


#2

simd seems to need some help:

julia> using BenchmarkTools

julia> v = rand(Complex128, 1000); #only 1000, not 10_000

julia> @benchmark $v' * $v
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     593.893 ns (0.00% GC)
  median time:      594.764 ns (0.00% GC)
  mean time:        629.369 ns (0.00% GC)
  maximum time:     1.130 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     178

julia> @benchmark sum(abs2, $v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.229 μs (0.00% GC)
  median time:      1.230 μs (0.00% GC)
  mean time:        1.267 μs (0.00% GC)
  maximum time:     4.412 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function self_dot(x::AbstractArray{Complex{T}}) where T
           out = zero(T)
           @simd for xi ∈ x
               out += real(xi)*real(xi)+imag(xi)*imag(xi)
           end
           out
       end
self_dot (generic function with 1 method)

julia> @benchmark self_dot($v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.232 μs (0.00% GC)
  median time:      1.233 μs (0.00% GC)
  mean time:        1.291 μs (0.00% GC)
  maximum time:     9.393 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function self_dot2(x::AbstractArray{Complex{T}}) where T
           out = zero(T)
           n, r = divrem(length(x),2)
           @inbounds begin
           @simd for i ∈ 1:n 
               out += real(x[2i-1])*real(x[2i-1])+imag(x[2i-1])*imag(x[2i-1])+real(x[2i])*real(x[2i])+imag(x[2i])*imag(x[2i])
           end
           if r == 1
               out += real(x[end])*real(x[end])+imag(x[end])*imag(x[end])
           end
           end
           out
       end
self_dot2 (generic function with 1 method)

julia> @benchmark self_dot2($v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     704.069 ns (0.00% GC)
  median time:      727.104 ns (0.00% GC)
  mean time:        741.079 ns (0.00% GC)
  maximum time:     1.056 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     144

Still slower, but close.


#3

The scalar product seems to exploit the fact that it is multiplying a vector by itself:

julia> v = rand(Complex128, 10000);

julia> @btime $v⋅$v;
  3.598 μs (1 allocation: 32 bytes)

julia> u = rand(Complex128, 10000);

julia> @btime $v⋅$u;
  8.041 μs (1 allocation: 32 bytes)

so I’m guessing that the BLAS routine checks for identical pointers and avoids computing the complex part when it doesn’t need to.

The fastest I’ve found is:

julia> Base.sum(::typeof(abs2), v::Array{Complex{T},1}) where {T<:Real} = sum(abs2, reinterpret(T, v))

julia> @btime sum(abs2, $v);
  2.283 μs (2 allocations: 80 bytes)

#4

I’m guessing the reason why reinterpret-ing as a real vector is faster is because the compiler is free to choose the order in which elements are summed and this enables effective use of SIMD instructions. Addition is not associative for floating point numbers, so when abs2 explicitly adds elements pairwise inside the sum reduction, the compiler should not override that. (If the two numbers in a pair were of opposite sign, this could have a significant effect on the result, but here we know that they are both nonnegaitive.)