I wrote a program with Julia, but the performance is disappointing.

The program is constructed with some functions, and global variables are prevented, arrays are well preallocated before used, the fft operation is accelerated by setting FFTW.set_num_threads(Sys.CPU_CORES).

No red words are shown when profiling it with @code_warntype. When I use the @profile macro to find out the bottleneck of the code, I find all the most-time-consuming lines share a common word: cis!

Below are these lines:

for j in eachindex(u)
prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j] # 2209
end
...
for j2 in 1:ngrid, j1 in 1:ngrid
uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2]) ## 2462
end
...
for j2 in 1:ngrid2, j1 in 1:ngrid2
uo[j1, j2] *= cis( -dt * (pp[j1] + pp[j2]) ) # 38049
end

the number at the end of each line is the corresponding result of Profile.print() for this line. Note that the total number is 50454.
When I implement the program with Fortran, these lines never much time, but never so much. I wonder if there is anything wrong with my Julia version.

cis is faster than exp(im * ...), but not particularly optimized, it simply calls Complex(cos(x), sin(x)). This function should be a bit faster:

const A = Ref{Cdouble}()
const B = Ref{Cdouble}()
function my_cis(x::Real)
ccall((:sincos, Base.Math.libm), Void, (Cdouble, Ptr{Cdouble}, Ptr{Cdouble}), x, A, B)
return Complex(B[], A[])
end

Some benchmarks:

julia> using BenchmarkTools
julia> g = collect(linspace(-100, 100, 1000000));
julia> @benchmark cis.($g)
BenchmarkTools.Trial:
memory estimate: 15.26 MiB
allocs estimate: 2
--------------
minimum time: 27.409 ms (0.00% GC)
median time: 30.013 ms (0.59% GC)
mean time: 30.225 ms (2.78% GC)
maximum time: 87.457 ms (64.57% GC)
--------------
samples: 166
evals/sample: 1
julia> @benchmark my_cis.($g)
BenchmarkTools.Trial:
memory estimate: 15.26 MiB
allocs estimate: 2
--------------
minimum time: 19.548 ms (0.00% GC)
median time: 21.421 ms (0.00% GC)
mean time: 21.835 ms (3.11% GC)
maximum time: 79.516 ms (70.58% GC)
--------------
samples: 229
evals/sample: 1

BTW, FFTW.set_num_threads(Sys.CPU_CORES) might not be a good idea, you can run into hyper-threading penalties.

If you want more specific advices, you probably need to provide a more complete example (not the whole program, but just a minimal working code reproducing your issue).

for i in 1:nt
step_evolution_real!(u, p_fft!, prop_x, prop_p, ele[i], xplusy, prop_e)
uo .= 0.0
for j2 in eachindex(p), j1 in eachindex(p)
uo[j1+Δn, j2+Δn] = u[j1, j2] * splitter[j1, j2] * cis(A[i]*xplusy[j1, j2])
end
for j in eachindex(u)
u[j] *= 1 - splitter[j]
end
p_fft2! * uo
for j in eachindex(pp)
pp[j] = ( p2[j]^2 /2.0 ) * (nt-i) + p2[j]*sum(@view A[i:nt]) + um(@viewA²[i:nt])/2.0
end
for (j2, pp2) in enumerate(pp), (j1, pp1) in enumerate(pp)
uo[j1, j2] *= cis( -dt * (pp1 + pp2) )
end
for j in eachindex(up)
up[j] += uo[j]
end
end

where the function step_evolution_real! is

function step_evolution_real!(u::Array{Complex128, 2},
p_fft!::Base.DFT.FFTW.cFFTWPlan{Complex128, -1, true, 2},
prop_x::Array{Complex128, 2}, prop_p::Array{Complex128, 2},
ele::Float64, xplusy::Array{Float64, 2},
prop_e::Array{Complex128, 2})
for j in eachindex(u)
prop_e[j] = cis( -ele * (xplusy[j] * dt/2.0) ) * prop_x[j]
end
for j in eachindex(u)
u[j] *= prop_e[j]
end
p_fft! * u
for j in eachindex(u)
u[j] *= prop_p[j]
end
p_fft! \ u
for j in eachindex(u)
u[j] *= prop_e[j]
end
nothing
end

As other posts show, cis() is translated to a scalar function call. I suggest rewriting the expensive loops to use vector math library calls (via VML.jl, Yeppp.jl, etc.). The extra trouble with temporary arrays is often worthwhile in cases like this.

Many Fortran compilers automatically use SIMD vector math libraries for cases like this. There is ongoing work to add such facility to LLVM, so eventually it should work for Julia too.

Take a look at this thread, where there is a similar discussion, and with examples of using a vector library (AppleAccelerate in that case) to speed up cis.

The VML.jl and Yeppp.jl are both sometimes helpful, but not in my condition. I need the exponential function for complex numbers. As I know these packages do not support complex numbers now.

Do you need it for complex numbers, or for imaginary numbers?

It may not help you, but AppleAccelerate.jl has both an optimized cis and cis! function which will work for imaginary numbers (by than I mean that you input imag(z)).