Benchmarking a simple PDE algorithm in Julia, Python, Matlab, C++, and Fortran


#1

As a warm-up and preliminary test for writing a distributed-memory fluids code in Julia, I did some benchmarking of Julia on a simple numerical integration algorithm for the 1d Kuramoto-Sivashinky equation.

Briefly, the results show that in execution speed, Julia beats Python and Matlab significantly and is competitive with C++ and Fortran. In line count versus speed, it hits the sweet spot: almost as compact as Python (~20 lines), almost as fast as Fortran (~60 lines).

The 1d Kuramoto-Sivashinky (KS) equation is

u_t + u_{xx} + u_{xxxx}+ u u_x = 0

where x is space and t is time and subscripts indicate differentiation. The algorithm assumes a finite periodic domain, discretizes space with a Fourier decomposition and time with 2nd-order Crank-Nicolson, Adams-Bashforth finite-difference time-stepping. The implementations in all languages use the same FFTW library so the benchmark measures language-specific overhead for things like bounds-checking and allocation of temporary arrays. The equation and algorithm are a sort of super-reduced toy model of Navier-Stokes and its algorithms.

I’m eager for feedback from more knowledgable Julia people, and I could use some help implementing the algorithm in Numba-Python, Dedalus-Python, and DifferentialEquations.jl & ApproxFun.jl. I could also use help figuring out why the Fortran code seg faults for N_x \geq 8192.


#2

For some reason this works
better (runs faster) than the unrolled loops produced by the Julia compiler.

Yes, Nu .= u is still slower than you’d expect, though copy!(Nu,u) does about as well as the unrolled loop. This is because broadcast still has a cost, which hopefully gets improved in the future.

A DiffEq Crank-Nicolson Adams-Bashforth is a ways off and there are other things which have higher priority right now. But I’ll revisit this later. Thanks for making this.


#3

One further way you can improve the benchmark is that, now that it’s all unrolled, add Threads.@threads to each of the loops (make sure you export JULIA_NUM_THREADS=n before opening Julia: https://docs.julialang.org/en/stable/manual/parallel-computing/#Setup-1). Then both the loops and the FFTs will be multithreaded. It’s not cheating because MATLAB implicitly multithreads its vectorized computations, so this is just putting Julia in a fair place.

Also, make sure Julia is using -O3.


#4

Thanks for the tips. I didn’t know about “julia -O3”! I’ll try that and copy!(Nu,u) immediately.

All the computations are single-core, including matlab which I throttled down. I’ll highlight that in the notebook.


#5

Actually, copy! doesn’t seem to be making much of a difference anymore. Not sure why.

Edit

This is what I’m getting on v0.6:

julia> A = rand(1000,1000); B = rand(1000,1000);

julia> using BenchmarkTools

julia> f(A,B) = A.=B
f (generic function with 1 method)

julia> g(A,B) = copy!(A,B)
g (generic function with 1 method)

julia> @benchmark f($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     453.456 μs (0.00% GC)
  median time:      511.418 μs (0.00% GC)
  mean time:        556.719 μs (0.00% GC)
  maximum time:     1.094 ms (0.00% GC)
  --------------
  samples:          8888
  evals/sample:     1

julia> @benchmark g($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     452.285 μs (0.00% GC)
  median time:      488.292 μs (0.00% GC)
  mean time:        518.677 μs (0.00% GC)
  maximum time:     1.471 ms (0.00% GC)
  --------------
  samples:          9542
  evals/sample:     1

julia> h(A,B) = @inbounds for i in eachindex(A); A[i]=B[i]; end
h (generic function with 1 method)

julia> @benchmark h($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     822.894 μs (0.00% GC)
  median time:      889.348 μs (0.00% GC)
  mean time:        931.553 μs (0.00% GC)
  maximum time:     1.848 ms (0.00% GC)
  --------------
  samples:          5323
  evals/sample:     1

julia> k(A,B) = @inbounds for i in 1:length(A); A[i]=B[i]; end
k (generic function with 1 method)

julia> @benchmark k($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     823.480 μs (0.00% GC)
  median time:      891.397 μs (0.00% GC)
  mean time:        938.289 μs (0.00% GC)
  maximum time:     1.653 ms (0.00% GC)
  --------------
  samples:          5286
  evals/sample:     1

So either 1-liner should be faster than the loop unless you multithread it. But avoiding the one with loop fusion will be faster.


#6

Thanks for the optimization suggestions. Running julia as “julia -O3” shaved 5% or so off the Julia timings and got Julia-inplace below C++ and Julia-unrolled below Fortran. I’ve updated the notebook to reflect.


#7

Great benchmark study! (And not just because Julia does so well – it’s very thorough and beautifully explained and plotted.)


#8

Try ulimit -s unlimited in the shell before running the Fortran code. In my experience Fortran compilers like to put temporary/local arrays on the stack, but the default stack size is quite limited (8 MiB) on most systems nowadays. But I admit I haven’t looked at your code.


#9

That is a nice benchmark indeed.
I tried to open the link of the Julia ‘unrolled’ version. The hyperlink in the notebook points here https://render.githubusercontent.com/view/codes/ksintegrateUnrolled.jl

which is a file that does not exist.

I later realized, that simply ‘clicking’ the hyperlink opens the right version of the code on Github. So I found the code.

However, I am very much used to ‘Open in new tab’ or download a file through right click, both of which failed for me on this link.
Any idea why this is not working? Is this a general issue with Jupyter notebooks (I have no experience with Jupyter).

Apologies for the off-topic question.


#10

Tried it, but it didn’t work. After compiling and running with ulimit -s unlimited and Nx=8192, the stack trace in gdb is

gibson@sophist$ gdb ksbenchmark-f90 
...
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
(gdb) where
#0  0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
#1  0x00000000004010b0 in ksintegrate (nt=3200, 
    u=<error reading variable: value requires 131072 bytes, which is more than max-value-size>) at ksbenchmark.f90:118
#2  0x00000000004018bb in MAIN__ () at ksbenchmark.f90:55
(gdb) 

Clearly a memory error. I guess an Nx=8192 complex double-precision array totals 8192 * 2 * 8 = 131072 bytes. I suppose I’ll try to understand max-value-size in Fortran.

I did revise the Fortran code to use heap-allocated arrays but got the same error --which is hard to understand. I’ll double-check that and post the revised code.


#11

Thank you! Links in these documents are not direct, but typically route a file specified in the URL through a “viewer” webservice that reads the file, translates it to nicely formatted HTML, and displays that.

For example, the URL http://nbviewer.jupyter.org/url/www.channelflow.org/iam961/iam961-hw3.ipynb reads the file http://www.channelflow.org/iam961/iam961-hw3.ipynb and routes it through the nbviewer ipynb renderer at nbviewer.jupyter.org. I believe the Markdown links work similarly.


#12

I’m curious to see a plot of cpu time/sloc.


#13

Good idea, will add to notebook. Define speed as 1/cputime, then speed/linecount = 1/(cputime*linecount). Plot normalized so C++ = 1.


#14

RIght, cpu time/sloc was actually a bit silly :smiley:

I’m more and more convinced that Julia doesn’t necessarily win the pure speed race, but usually outperforms other languages when taking into account the time you spend writing the code (very roughly measured with the number of lines you have to have to write).


#15

I think if you write Julia like it’s C (that is, using pointer, etc.) then it can tie with C and Fortran.

No one writes Julia code like that, and for good reason, but it’s not fair to say Julia doesn’t “win”, since it’s more a statement about coding style than the language itself.


#16

Well of course Julia doesn’t necessarily win. There’s a way to rewrite the C or Fortran algorithms here to get it as fast as the Julia version. I don’t think anyone would argue that. But I think it’s safe to say that Julia’s optimizations may actually beat a lot of scientist’s ability to “hand optimize” C/Fortran code. And what you get in return is something that is much more concise which then makes it easier to add all the bells and whistles. You can probably dig into the Fortran code here and pull out another 10%, but in the Julia version you can add Threads.@threads and maybe add PI adaptive timestepping in 20 lines and end up with something orders of magnitude faster, still ending up with more concise code. That’s why Julia does well.

And actually, if Julia gets an @noalias macro for creating local scopes which guarantee arrays aren’t aliasing (or something like volatile), Julia could probably match the last few compiler optimizations which require the absence of aliasing. At that point it probably comes down to which compiler is used more than anything else, and gcc vs clang/LLVM seems to be a toss up: usually very similar, but in some problems one or the other “wins”, but you can never guess which one (and when asking people to recompile whatever reference C code via clang and checking the timing difference actually accounts for most benchmark variation from what I’ve seen, i.e. the difference between C and Julia usually is mostly measuring a difference between gcc and clang on that algorithm).


#17

It would be interesting to have benchmarks for ETDRK4, which is designed for stiff PDEs such as KS. The special functions that are required (at least with periodic boundary conditions) could be extracted from here:

and an ApproxFun-style code that could be optimized for non-adaptivity is here:

Related references:

S. M. Cox and P. C. Matthews. Exponential time differencing for stiff systems. J. Comp. Phys., 176:430–455, 2002.
A.-K. Kassam and L. N. Trefethen. Fourth-order time-stepping for stiff PDEs. SIAM J. Sci. Comput., 26:1214–1233, 2005.
H. Montanelli and N. Bootland. Solving periodic semilinear stiff PDEs in 1D, 2D, and 3D with exponential integrators. arXiv:1604.08900, 2016.


#18

To do exponential integrators well we really need to up our game with expmv methods. The following two libraries exist:

The first actually uses a dense normest2 so it’s not the true Higham algorithm yet and that slows it down. The latter uses a Krylov-based method, but doesn’t have all of the phi function calculations yet which are necessary for the high order methods like ETDRK4.

Though with L coming from a spectral discretization it’s probably small and dense so it might not need these tools as much? But the lack of such tools (i.e. implement a way to choose between dense and these different expmvs) is what is slowing down the generic DiffEq implementation of exponential integrators.


#19

I’m guessing no: the point of spectral algorithms is that the matrix is dense, but the matvecs are O(n log n) because of FFTs. So what it really needs is a matrix-free interface where only A*x is used.


#20

While the FFT is matrix-free, the exponential integrator is not matrix-free since exp(A*t) is a dense operator which is unknown but necessary. That’s why expmv(A,dt,v) = exp(A*dt)*v is important for any sufficiently large problem to be solved with an exponential integrator.

When n is small, exp(A*dt) can be cached. Most FEM/FDM discretizations are not in this domain for sufficiently difficult problems, but spectral discretizations may be in this domain? If they are, that simplifies things a bit.

One question I have though. Does \ for an FFT call the inverse FFT? If so, that would be nice because then the algorithm can be more easily written on a generic A than what’s done in this benchmark. Otherwise we’d need a good way to get inv(A) from the user for the specific case of FFTs.