Poor performance of the cis() function

111 · April 27, 2017, 12:18pm

I wrote a program with Julia, but the performance is disappointing.

The program is constructed with some functions, and global variables are prevented, arrays are well preallocated before used, the fft operation is accelerated by setting FFTW.set_num_threads(Sys.CPU_CORES).

No red words are shown when profiling it with @code_warntype. When I use the @profile macro to find out the bottleneck of the code, I find all the most-time-consuming lines share a common word: cis!

Below are these lines:

for j in eachindex(u)
    prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j] # 2209
end
...

for j2 in 1:ngrid, j1 in 1:ngrid
    uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2]) ## 2462
end
...
for j2 in 1:ngrid2, j1 in 1:ngrid2
    uo[j1, j2] *= cis( -dt * (pp[j1] + pp[j2]) )  # 38049
end

the number at the end of each line is the corresponding result of Profile.print() for this line. Note that the total number is 50454.
When I implement the program with Fortran, these lines never much time, but never so much. I wonder if there is anything wrong with my Julia version.

giordano · April 27, 2017, 2:07pm

cis is faster than exp(im * ...), but not particularly optimized, it simply calls Complex(cos(x), sin(x)). This function should be a bit faster:

const A = Ref{Cdouble}()
const B = Ref{Cdouble}()
function my_cis(x::Real)
    ccall((:sincos, Base.Math.libm), Void, (Cdouble, Ptr{Cdouble}, Ptr{Cdouble}), x, A, B)
    return Complex(B[], A[])
end

Some benchmarks:

julia> using BenchmarkTools

julia> g = collect(linspace(-100, 100, 1000000));

julia> @benchmark cis.($g)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  2
  --------------
  minimum time:     27.409 ms (0.00% GC)
  median time:      30.013 ms (0.59% GC)
  mean time:        30.225 ms (2.78% GC)
  maximum time:     87.457 ms (64.57% GC)
  --------------
  samples:          166
  evals/sample:     1

julia> @benchmark my_cis.($g)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  2
  --------------
  minimum time:     19.548 ms (0.00% GC)
  median time:      21.421 ms (0.00% GC)
  mean time:        21.835 ms (3.11% GC)
  maximum time:     79.516 ms (70.58% GC)
  --------------
  samples:          229
  evals/sample:     1

BTW, FFTW.set_num_threads(Sys.CPU_CORES) might not be a good idea, you can run into hyper-threading penalties.

If you want more specific advices, you probably need to provide a more complete example (not the whole program, but just a minimal working code reproducing your issue).

yuyichao · April 27, 2017, 3:08pm

See also https://github.com/JuliaLang/julia/pull/21589 or the original fastmath only sincos implementation https://github.com/nacs-lab/yyc-data/blob/f717b873d8dece11703b90b8c1c03552368cf359/lib/NaCsCalc/src/utils.jl#L115-L141

giordano · April 27, 2017, 4:53pm

Nice. Just to understand: what’s the problem with directly ccalling sincos? Ref is already used in modf, for example.

yuyichao · April 27, 2017, 5:05pm

It’ll allocate memory. As mentioned in the PR, it’ll not be necessary once we can stack allocate those.

yuyichao · April 27, 2017, 5:06pm

And the use of Ref in modf is not thread safe.

111 · April 28, 2017, 2:25am

for i in 1:nt
   step_evolution_real!(u, p_fft!, prop_x, prop_p, ele[i], xplusy, prop_e)

   uo .= 0.0
   for j2 in eachindex(p), j1 in eachindex(p)
      uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2])
   end
   for j in eachindex(u)
      u[j] *= 1 - splitter[j]
   end
   p_fft2! * uo
   for j in eachindex(pp)
     pp[j] = ( p2[j]^2 /2.0 ) * (nt-i) + p2[j]*sum(@view A[i:nt]) + um(@viewA²[i:nt])/2.0
   end
   for (j2, pp2) in enumerate(pp), (j1, pp1) in enumerate(pp)
      uo[j1, j2] *= cis( -dt * (pp1 + pp2) )
   end
   for j in eachindex(up)
      up[j] += uo[j]
   end
end

where the function step_evolution_real! is

function step_evolution_real!(u::Array{Complex128, 2},
p_fft!::Base.DFT.FFTW.cFFTWPlan{Complex128, -1, true, 2},
prop_x::Array{Complex128, 2}, prop_p::Array{Complex128, 2},
ele::Float64, xplusy::Array{Float64, 2},
prop_e::Array{Complex128, 2})
for j in eachindex(u)
   prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j]
end
for j in eachindex(u)
   u[j] *= prop_e[j]
end
p_fft! * u
for j in eachindex(u)
   u[j] *= prop_p[j]
end
p_fft! \ u
for j in eachindex(u)
   u[j] *= prop_e[j]
end
nothing
end

Ralph_Smith · April 28, 2017, 1:44pm

As other posts show, cis() is translated to a scalar function call. I suggest rewriting the expensive loops to use vector math library calls (via VML.jl, Yeppp.jl, etc.). The extra trouble with temporary arrays is often worthwhile in cases like this.

Many Fortran compilers automatically use SIMD vector math libraries for cases like this. There is ongoing work to add such facility to LLVM, so eventually it should work for Julia too.

DNF · April 28, 2017, 2:54pm

Take a look at this thread, where there is a similar discussion, and with examples of using a vector library (AppleAccelerate in that case) to speed up cis.

111 · May 2, 2017, 1:57pm

The VML.jl and Yeppp.jl are both sometimes helpful, but not in my condition. I need the exponential function for complex numbers. As I know these packages do not support complex numbers now.

kristoffer.carlsson · May 2, 2017, 2:33pm

But exp(z) = e^real(z) * (cos(imag(z)) + i * sin(imag(z)))

111 · May 3, 2017, 10:36am

I don’t think of rewritting the cis function so complexly as a good solution, It seems to be even slower.

DNF · May 3, 2017, 1:34pm

Do you need it for complex numbers, or for imaginary numbers?

It may not help you, but AppleAccelerate.jl has both an optimized cis and cis! function which will work for imaginary numbers (by than I mean that you input imag(z)).

111 · May 3, 2017, 2:06pm

I need the cis(x::Float64) function. AppleAccelerate.jl is a good package, but I am a windows user.

Topic		Replies	Views
@btime of cis Function slower than writing it again Numerics performance , benchmarktools	2	396	August 25, 2022
`cispi` is not accurate General Usage	6	595	July 6, 2022
Simple speed comparison with Matlab New to Julia question	33	1261	May 6, 2025
Slow complex exponential evaluation Numerics	8	2399	April 11, 2017
Cosine seems slow Performance	14	1805	November 27, 2019

Poor performance of the cis() function

Related topics