Poor performance of the cis() function

performance

#1

I wrote a program with Julia, but the performance is disappointing.

The program is constructed with some functions, and global variables are prevented, arrays are well preallocated before used, the fft operation is accelerated by setting FFTW.set_num_threads(Sys.CPU_CORES).

No red words are shown when profiling it with @code_warntype. When I use the @profile macro to find out the bottleneck of the code, I find all the most-time-consuming lines share a common word: cis!

Below are these lines:

for j in eachindex(u)
    prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j] # 2209
end
...

for j2 in 1:ngrid, j1 in 1:ngrid
    uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2]) ## 2462
end
...
for j2 in 1:ngrid2, j1 in 1:ngrid2
    uo[j1, j2] *= cis( -dt * (pp[j1] + pp[j2]) )  # 38049
end

the number at the end of each line is the corresponding result of Profile.print() for this line. Note that the total number is 50454.
When I implement the program with Fortran, these lines never much time, but never so much. I wonder if there is anything wrong with my Julia version.


#2

cis is faster than exp(im * ...), but not particularly optimized, it simply calls Complex(cos(x), sin(x)). This function should be a bit faster:

const A = Ref{Cdouble}()
const B = Ref{Cdouble}()
function my_cis(x::Real)
    ccall((:sincos, Base.Math.libm), Void, (Cdouble, Ptr{Cdouble}, Ptr{Cdouble}), x, A, B)
    return Complex(B[], A[])
end

Some benchmarks:

julia> using BenchmarkTools

julia> g = collect(linspace(-100, 100, 1000000));

julia> @benchmark cis.($g)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  2
  --------------
  minimum time:     27.409 ms (0.00% GC)
  median time:      30.013 ms (0.59% GC)
  mean time:        30.225 ms (2.78% GC)
  maximum time:     87.457 ms (64.57% GC)
  --------------
  samples:          166
  evals/sample:     1

julia> @benchmark my_cis.($g)
BenchmarkTools.Trial: 
  memory estimate:  15.26 MiB
  allocs estimate:  2
  --------------
  minimum time:     19.548 ms (0.00% GC)
  median time:      21.421 ms (0.00% GC)
  mean time:        21.835 ms (3.11% GC)
  maximum time:     79.516 ms (70.58% GC)
  --------------
  samples:          229
  evals/sample:     1

BTW, FFTW.set_num_threads(Sys.CPU_CORES) might not be a good idea, you can run into hyper-threading penalties.

If you want more specific advices, you probably need to provide a more complete example (not the whole program, but just a minimal working code reproducing your issue).


#3

See also https://github.com/JuliaLang/julia/pull/21589 or the original fastmath only sincos implementation https://github.com/nacs-lab/yyc-data/blob/f717b873d8dece11703b90b8c1c03552368cf359/lib/NaCsCalc/src/utils.jl#L115-L141


#4

Nice. Just to understand: what’s the problem with directly ccalling sincos? Ref is already used in modf, for example.


#5

It’ll allocate memory. As mentioned in the PR, it’ll not be necessary once we can stack allocate those.


#6

And the use of Ref in modf is not thread safe.


#7
for i in 1:nt
   step_evolution_real!(u, p_fft!, prop_x, prop_p, ele[i], xplusy, prop_e)

   uo .= 0.0
   for j2 in eachindex(p), j1 in eachindex(p)
      uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2])
   end
   for j in eachindex(u)
      u[j] *= 1 - splitter[j]
   end
   p_fft2! * uo
   for j in eachindex(pp)
     pp[j] = ( p2[j]^2 /2.0 ) * (nt-i) + p2[j]*sum(@view A[i:nt]) + um(@viewA²[i:nt])/2.0
   end
   for (j2, pp2) in enumerate(pp), (j1, pp1) in enumerate(pp)
      uo[j1, j2] *= cis( -dt * (pp1 + pp2) )
   end
   for j in eachindex(up)
      up[j] += uo[j]
   end
end

where the function step_evolution_real! is

function step_evolution_real!(u::Array{Complex128, 2},
p_fft!::Base.DFT.FFTW.cFFTWPlan{Complex128, -1, true, 2},
prop_x::Array{Complex128, 2}, prop_p::Array{Complex128, 2},
ele::Float64, xplusy::Array{Float64, 2},
prop_e::Array{Complex128, 2})
for j in eachindex(u)
   prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j]
end
for j in eachindex(u)
   u[j] *= prop_e[j]
end
p_fft! * u
for j in eachindex(u)
   u[j] *= prop_p[j]
end
p_fft! \ u
for j in eachindex(u)
   u[j] *= prop_e[j]
end
nothing
end

#8

As other posts show, cis() is translated to a scalar function call. I suggest rewriting the expensive loops to use vector math library calls (via VML.jl, Yeppp.jl, etc.). The extra trouble with temporary arrays is often worthwhile in cases like this.

Many Fortran compilers automatically use SIMD vector math libraries for cases like this. There is ongoing work to add such facility to LLVM, so eventually it should work for Julia too.


#9

Take a look at this thread, where there is a similar discussion, and with examples of using a vector library (AppleAccelerate in that case) to speed up cis.


#10

The VML.jl and Yeppp.jl are both sometimes helpful, but not in my condition. I need the exponential function for complex numbers. As I know these packages do not support complex numbers now.


#11

But exp(z) = e^real(z) * (cos(imag(z)) + i * sin(imag(z)))


#12

I don’t think of rewritting the cis function so complexly as a good solution, It seems to be even slower.


#13

Do you need it for complex numbers, or for imaginary numbers?

It may not help you, but AppleAccelerate.jl has both an optimized cis and cis! function which will work for imaginary numbers (by than I mean that you input imag(z)).


#14

I need the cis(x::Float64) function. AppleAccelerate.jl is a good package, but I am a windows user.