Optimize function with complex square root

mleprovost · February 26, 2021, 12:41am

Hello,

I am writing a routine to store a matrix that contains the values of f(x, z) = \frac{1}{\sqrt{1-x^2}} \times \text{real}\left(\frac{\sqrt{z-1}\sqrt{z+1}}{z-x}\right) for \sim 1000 real values x \in [-1, 1] and \sim 50 complex numbers z.

This computation is a critical part of my code, do you have ideas of how I can improve it?

We cannot simplify \sqrt{z-1}\sqrt{z+1} since z is complex. Otherwise, the branch cut of the square root is no longer on the real axis between -1 and 1.

using LinearAlgebra
using BenchmarkTools

function test(x::Array{Float64,1}, z::Array{ComplexF64,1})
   A = zeros(size(x,1), size(z, 1))   
   @inbounds for (j, zj) in enumerate(z) 
        bj = √(zj-1.0)*√(zj+1.0)
        for (i, xi) in enumerate(x)
            ci = (1.0 - xi^2)^(-0.5)
            A[i,j] = ci*real(bj/(zj-xi))
        end
    end
    return A
end

x = 2*rand(1000) .- 1
z = randn(50) + im*randn(50)

@btime test($x, $z)
989.748 μs (2 allocations: 390.70 KiB)

stevengj · February 26, 2021, 1:21am

Although I can see several possible ways to further optimize this code, some context would be helpful.

What is this for? Is it performance-critical because the matrix is huge or because you are computing it over and over for different x and/or z?

mleprovost · February 26, 2021, 1:48am

The computation arises in a potential flow problem in fluid mechanics. I am interested in the induced effect of a collection of singularities at the locations {z_j} on an infinitely-thin plate materizalized by the interval [-1,1]. Here f(x_i,z_j) computes the induce effect of a point vortex at location z_j at a point x_i along the plate. This computation is used in the forecast step of an ensemble Kalman filter. Therefore, I need to compute this matrix for about 300 (number of ensemble members) different sets of {z_j} values. The function test performs this computation for a distributed set of {x_i} values for only one set of {z_j} values. The set of {x_i} values is always the same. I hope this helps

PeterSimon · February 26, 2021, 4:06am

LoopVectorization produces a huge speedup on my machine. Here is the contents of the file testit.jl:

using LinearAlgebra
using BenchmarkTools
using LoopVectorization

function test(x::Array{Float64,1}, z::Array{ComplexF64,1})
   A = zeros(size(x,1), size(z, 1))   
   @inbounds for (j, zj) in enumerate(z) 
        bj = √(zj-1.0)*√(zj+1.0)
        for (i, xi) in enumerate(x)
            ci = (1.0 - xi^2)^(-0.5)
            A[i,j] = ci*real(bj/(zj-xi))
        end
    end
    return A
end

function test2(x, z)
    a = zeros(length(x),length(z))
    b = [√(t-1)*√(t+1) for t in z]
    c = [1/sqrt(1 - t^2) for t in x]
    @avx for i in axes(a,1), j in axes(a,2)
        a[i,j] = c[i] * real(b[j]/(z[j]-x[i])) 
    end
    a
end

x = 2*rand(1000) .- 1
z = randn(50) + im*randn(50)

versioninfo()
a = @btime test($x,$z)
a2 = @btime test2($x,$z)
@show norm(a-a2)

This produces the following output:

julia> include("testit.jl")
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = runemacs.exe
  JULIA_NUM_THREADS = 8

  947.700 μs (2 allocations: 390.70 KiB)
  66.500 μs (4 allocations: 399.52 KiB)
norm(a - a2) = 9.676938524225372e-14

moeddel · February 26, 2021, 9:01am

I can confirm this. You even get a little better if you do calculations in single precicion.

using BenchmarkTools
using LoopVectorization

function test(x::Vector{T}, z::Vector{Complex{T}}) where T<:Real
    A = zeros(T,length(x), length(z))
    b = map(z->√(z - one(T))*√(z + one(T)),z)
	c = map(x->(one(T) - x^2)^(-0.5),x)
    @avx for j in eachindex(z)
        for i in eachindex(x)
            A[i,j] = c[i]*real(b[j]/(z[j]-x[i]))
        end
     end
     return A
end

T = Float32
x = 2*rand(T,1000) .- 1
z = randn(Complex{T},50)

It boosts from 70 μs to 62 μs on my machine, which is not the fastest. That is if you do not need the numerical accuracy of double precision.

DNF · February 26, 2021, 9:19am

Tullio.jl can help you further, if you have a few cores:

function test2(x, z)
    b = sqrt.(z .- 1) .* sqrt.(z .+ 1)
    c = inv.(sqrt.(1 .- x.^2))
    @tullio A[i, j] := c[i] * real(b[j]/(z[j]-x[i]))
    return A
end

rafael.guerra · February 26, 2021, 12:40pm

For whatever it is worth, it is interesting that this function can be written without loops with adjoints. The @fastmath macro speeds up significantely but not as much as Tullio does:

function test4(x, z)
    b = @. √(z-1)*√(z+1)
    c = @. 1.0/sqrt(1 - x^2)
    @fastmath a = @. c*real(b'/(z' - x)) # using adjoints
    return a
end

stevengj · February 26, 2021, 1:37pm

In that case, you could pre-compute the 1/\sqrt{1-x_i^2} values, and pre-allocate the A matrix.

As others have pointed out, @fastmath helps a lot — the reason is that complex-number division normally requires a bunch of costly checks to eliminate spurious overflows and similar conditions that can arise for extreme inputs (e.g. x[j] = 1e300), which you can safely eliminate if you know that you always have moderate-sized inputs and never divide by zero.

One can also use threads, but probably you are better off using threads at a higher level (e.g. to process different ensemble instances in parallel), so I’m only looking at the single-threaded performance here:

using BenchmarkTools

function test!(A, x, c, z)
    @fastmath @inbounds for j in eachindex(z)
        zj = z[j]
        bj = √(zj-1)*√(zj+1)
        for i in eachindex(x)
            A[i,j] = c[i] * real(bj / (zj - x[i]))
        end
    end
    return A
end

x = 2*rand(1000) .- 1
z = randn(50) + im*randn(50)
A = zeros(size(x,1), size(z, 1))
c = @. inv(sqrt(1 - x^2))

which gives:

julia> @btime test($x, $z);    # your original code
  1.335 ms (2 allocations: 390.70 KiB)

julia> @btime test!($A, $x, $c, $z);    # new code
  34.047 μs (0 allocations: 0 bytes)

for about 40× speedup.

(I kept the explicit loops in order to continue to compute bj = -√(zj-1)*√(zj+1) outside the inner loop without storing it in an additional array.)

I tried @avx from LoopVectorization.jl and it unfortunately gives me an error… it works if I write out real(bj / (zj - x[i])) explicitly in terms of real and imaginary parts, but in that case I get negligible speedup from @avx.

mleprovost · February 26, 2021, 10:34pm

Thank you very much for your help. This is definitly much more efficient.
Is eachindex preferable to enumerate for the performance?
Is there something that prevent me from using @fastmath in my future scripts everty time I am not dealing with very large entries?

moeddel · February 27, 2021, 8:47am

If you want to be sure that @fastmath does what you want, please check out this page, where it is documented, which kind of optimizations @fastmath makes and what kind of assumptions.

Henrique_Becker · February 27, 2021, 1:57pm

eachindex is more generic, enumerate does not give indexes, it gives 1 with the first element, 2 with the second, and so on; if your Array has more than one dimension, or custom indexes (start at 0 for example) then your code will break. eachindex will always give valid indexes, and for N-dimensional arrays, it will give the indexes in the most efficient order for linear access. If you want something that both give you valid indexes and the value then look at pairs documentation and use it instead of enumerate.

I talk a little more about this in my answer of mine.

DNF · February 27, 2021, 3:39pm

I wouldn’t say that. enumerate is generic, works also for non-standard indexed arrays, and (without testing) possibly for non-indexed collections (I should check this).

It just doesn’t do what many people think it does. It doesn’t return indices. It’s unfortunate that enumerate is so widely known, and pairs isn’t, because I suspect it’s very often misused.

I wonder if a linter would be able to catch this. If output from enumerate is used for indexing, it might suggest pairs instead.

Henrique_Becker · February 27, 2021, 5:24pm

Fair enough. I was saying that eachindex is more generic for the purpose of having both index and element, but maybe it would be best to say that eachindex is the correct choice for that purpose and enumerate is designed for another purpose (easily adding an iteration counter on a loop).

Topic		Replies	Views
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4867	July 19, 2021
How to optimize computation within vectorized list operation and large array? Performance	16	472	October 21, 2022
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2115	August 13, 2017
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	629	September 24, 2024
Performance of a key step in Strang Splitting method Performance pde	10	680	March 17, 2021

Optimize function with complex square root

Related topics