Sum(abs2, v) slow for complex vector

wsshin · April 15, 2018, 3:40am

I have been thinking that sum(abs2, v) is a more efficient way to calculate the sum of the absolute squares of the entries of a vector v than v⋅v. It is indeed a case for a real vector:

julia> VERSION
v"0.6.3-pre.0"

julia> using BenchmarkTools

julia> v = rand(10000);

julia> @btime $v⋅$v;
  2.659 μs (0 allocations: 0 bytes)

julia> @btime sum(abs2, $v);
  944.071 ns (0 allocations: 0 bytes)

I thought sum(abs2, v) is an even better choice than v⋅v for a complex vector v, because the former returns a real number whereas the latter returns a complex number (with a zero imaginary part). On the contrary, v⋅v turns out to be faster for a complex vector:

julia> v = rand(Complex128, 10000);

julia> @btime $v⋅$v;
  4.124 μs (1 allocation: 32 bytes)

julia> @btime sum(abs2, $v);
  12.554 μs (0 allocations: 0 bytes)

Not sure exactly why this is happening. Any insights?

Elrod · April 15, 2018, 5:56am

simd seems to need some help:

julia> using BenchmarkTools

julia> v = rand(Complex128, 1000); #only 1000, not 10_000

julia> @benchmark $v' * $v
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     593.893 ns (0.00% GC)
  median time:      594.764 ns (0.00% GC)
  mean time:        629.369 ns (0.00% GC)
  maximum time:     1.130 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     178

julia> @benchmark sum(abs2, $v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.229 μs (0.00% GC)
  median time:      1.230 μs (0.00% GC)
  mean time:        1.267 μs (0.00% GC)
  maximum time:     4.412 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function self_dot(x::AbstractArray{Complex{T}}) where T
           out = zero(T)
           @simd for xi ∈ x
               out += real(xi)*real(xi)+imag(xi)*imag(xi)
           end
           out
       end
self_dot (generic function with 1 method)

julia> @benchmark self_dot($v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.232 μs (0.00% GC)
  median time:      1.233 μs (0.00% GC)
  mean time:        1.291 μs (0.00% GC)
  maximum time:     9.393 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function self_dot2(x::AbstractArray{Complex{T}}) where T
           out = zero(T)
           n, r = divrem(length(x),2)
           @inbounds begin
           @simd for i ∈ 1:n 
               out += real(x[2i-1])*real(x[2i-1])+imag(x[2i-1])*imag(x[2i-1])+real(x[2i])*real(x[2i])+imag(x[2i])*imag(x[2i])
           end
           if r == 1
               out += real(x[end])*real(x[end])+imag(x[end])*imag(x[end])
           end
           end
           out
       end
self_dot2 (generic function with 1 method)

julia> @benchmark self_dot2($v)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     704.069 ns (0.00% GC)
  median time:      727.104 ns (0.00% GC)
  mean time:        741.079 ns (0.00% GC)
  maximum time:     1.056 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     144

Still slower, but close.

Per · April 15, 2018, 6:01am

The scalar product seems to exploit the fact that it is multiplying a vector by itself:

julia> v = rand(Complex128, 10000);

julia> @btime $v⋅$v;
  3.598 μs (1 allocation: 32 bytes)

julia> u = rand(Complex128, 10000);

julia> @btime $v⋅$u;
  8.041 μs (1 allocation: 32 bytes)

so I’m guessing that the BLAS routine checks for identical pointers and avoids computing the complex part when it doesn’t need to.

The fastest I’ve found is:

julia> Base.sum(::typeof(abs2), v::Array{Complex{T},1}) where {T<:Real} = sum(abs2, reinterpret(T, v))

julia> @btime sum(abs2, $v);
  2.283 μs (2 allocations: 80 bytes)

Per · April 15, 2018, 8:19am

I’m guessing the reason why reinterpret-ing as a real vector is faster is because the compiler is free to choose the order in which elements are summed and this enables effective use of SIMD instructions. Addition is not associative for floating point numbers, so when abs2 explicitly adds elements pairwise inside the sum reduction, the compiler should not override that. (If the two numbers in a pair were of opposite sign, this could have a significant effect on the result, but here we know that they are both nonnegaitive.)

Topic		Replies	Views
abs2(Float64) takes longer than abs2(Complex64)? General Usage performance	2	624	February 1, 2017
abs(::Int32) slower than abs(::Int) Performance question	7	287	January 19, 2024
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1326	April 17, 2017
Vector{Float} times Vector{Complex} slower than Vector{Complex} times Vector{Complex} Performance question , linearalgebra	6	545	July 26, 2019
Sqrt(abs(x)) is even faster than sqrt! Performance array	11	1573	March 29, 2021

Sum(abs2, v) slow for complex vector

Related topics