Benchmarking a simple PDE algorithm in Julia, Python, Matlab, C++, and Fortran

As a warm-up and preliminary test for writing a distributed-memory fluids code in Julia, I did some benchmarking of Julia on a simple numerical integration algorithm for the 1d Kuramoto-Sivashinky equation.

https://github.com/johnfgibson/julia-pde-benchmark/

Briefly, the results show that in execution speed, Julia beats Python and Matlab significantly and is competitive with C++ and Fortran. In line count versus speed, it hits the sweet spot: almost as compact as Python (~20 lines), almost as fast as Fortran (~60 lines).

The 1d Kuramoto-Sivashinky (KS) equation is

u_t + u_{xx} + u_{xxxx}+ u u_x = 0

where x is space and t is time and subscripts indicate differentiation. The algorithm assumes a finite periodic domain, discretizes space with a Fourier decomposition and time with 2nd-order Crank-Nicolson, Adams-Bashforth finite-difference time-stepping. The implementations in all languages use the same FFTW library so the benchmark measures language-specific overhead for things like bounds-checking and allocation of temporary arrays. The equation and algorithm are a sort of super-reduced toy model of Navier-Stokes and its algorithms.

Iā€™m eager for feedback from more knowledgable Julia people, and I could use some help implementing the algorithm in Numba-Python, Dedalus-Python, and DifferentialEquations.jl & ApproxFun.jl. I could also use help figuring out why the Fortran code seg faults for N_x \geq 8192.

21 Likes

For some reason this works
better (runs faster) than the unrolled loops produced by the Julia compiler.

Yes, Nu .= u is still slower than youā€™d expect, though copy!(Nu,u) does about as well as the unrolled loop. This is because broadcast still has a cost, which hopefully gets improved in the future.

A DiffEq Crank-Nicolson Adams-Bashforth is a ways off and there are other things which have higher priority right now. But Iā€™ll revisit this later. Thanks for making this.

3 Likes

One further way you can improve the benchmark is that, now that itā€™s all unrolled, add Threads.@threads to each of the loops (make sure you export JULIA_NUM_THREADS=n before opening Julia: https://docs.julialang.org/en/stable/manual/parallel-computing/#Setup-1). Then both the loops and the FFTs will be multithreaded. Itā€™s not cheating because MATLAB implicitly multithreads its vectorized computations, so this is just putting Julia in a fair place.

Also, make sure Julia is using -O3.

2 Likes

Thanks for the tips. I didnā€™t know about ā€œjulia -O3ā€! Iā€™ll try that and copy!(Nu,u) immediately.

All the computations are single-core, including matlab which I throttled down. Iā€™ll highlight that in the notebook.

2 Likes

Actually, copy! doesnā€™t seem to be making much of a difference anymore. Not sure why.

Edit

This is what Iā€™m getting on v0.6:

julia> A = rand(1000,1000); B = rand(1000,1000);

julia> using BenchmarkTools

julia> f(A,B) = A.=B
f (generic function with 1 method)

julia> g(A,B) = copy!(A,B)
g (generic function with 1 method)

julia> @benchmark f($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     453.456 Ī¼s (0.00% GC)
  median time:      511.418 Ī¼s (0.00% GC)
  mean time:        556.719 Ī¼s (0.00% GC)
  maximum time:     1.094 ms (0.00% GC)
  --------------
  samples:          8888
  evals/sample:     1

julia> @benchmark g($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     452.285 Ī¼s (0.00% GC)
  median time:      488.292 Ī¼s (0.00% GC)
  mean time:        518.677 Ī¼s (0.00% GC)
  maximum time:     1.471 ms (0.00% GC)
  --------------
  samples:          9542
  evals/sample:     1

julia> h(A,B) = @inbounds for i in eachindex(A); A[i]=B[i]; end
h (generic function with 1 method)

julia> @benchmark h($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     822.894 Ī¼s (0.00% GC)
  median time:      889.348 Ī¼s (0.00% GC)
  mean time:        931.553 Ī¼s (0.00% GC)
  maximum time:     1.848 ms (0.00% GC)
  --------------
  samples:          5323
  evals/sample:     1

julia> k(A,B) = @inbounds for i in 1:length(A); A[i]=B[i]; end
k (generic function with 1 method)

julia> @benchmark k($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     823.480 Ī¼s (0.00% GC)
  median time:      891.397 Ī¼s (0.00% GC)
  mean time:        938.289 Ī¼s (0.00% GC)
  maximum time:     1.653 ms (0.00% GC)
  --------------
  samples:          5286
  evals/sample:     1

So either 1-liner should be faster than the loop unless you multithread it. But avoiding the one with loop fusion will be faster.

Thanks for the optimization suggestions. Running julia as ā€œjulia -O3ā€ shaved 5% or so off the Julia timings and got Julia-inplace below C++ and Julia-unrolled below Fortran. Iā€™ve updated the notebook to reflect.

3 Likes

Great benchmark study! (And not just because Julia does so well ā€“ itā€™s very thorough and beautifully explained and plotted.)

6 Likes

Try ulimit -s unlimited in the shell before running the Fortran code. In my experience Fortran compilers like to put temporary/local arrays on the stack, but the default stack size is quite limited (8 MiB) on most systems nowadays. But I admit I havenā€™t looked at your code.

That is a nice benchmark indeed.
I tried to open the link of the Julia ā€˜unrolledā€™ version. The hyperlink in the notebook points here https://render.githubusercontent.com/view/codes/ksintegrateUnrolled.jl

which is a file that does not exist.

I later realized, that simply ā€˜clickingā€™ the hyperlink opens the right version of the code on Github. So I found the code.

However, I am very much used to ā€˜Open in new tabā€™ or download a file through right click, both of which failed for me on this link.
Any idea why this is not working? Is this a general issue with Jupyter notebooks (I have no experience with Jupyter).

Apologies for the off-topic question.

Tried it, but it didnā€™t work. After compiling and running with ulimit -s unlimited and Nx=8192, the stack trace in gdb is

gibson@sophist$ gdb ksbenchmark-f90 
...
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
(gdb) where
#0  0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
#1  0x00000000004010b0 in ksintegrate (nt=3200, 
    u=<error reading variable: value requires 131072 bytes, which is more than max-value-size>) at ksbenchmark.f90:118
#2  0x00000000004018bb in MAIN__ () at ksbenchmark.f90:55
(gdb) 

Clearly a memory error. I guess an Nx=8192 complex double-precision array totals 8192 * 2 * 8 = 131072 bytes. I suppose Iā€™ll try to understand max-value-size in Fortran.

I did revise the Fortran code to use heap-allocated arrays but got the same error --which is hard to understand. Iā€™ll double-check that and post the revised code.

Thank you! Links in these documents are not direct, but typically route a file specified in the URL through a ā€œviewerā€ webservice that reads the file, translates it to nicely formatted HTML, and displays that.

For example, the URL http://nbviewer.jupyter.org/url/www.channelflow.org/iam961/iam961-hw3.ipynb reads the file iam961_iam961-hw3.ipynb [channelflow.org] and routes it through the nbviewer ipynb renderer at nbviewer.jupyter.org. I believe the Markdown links work similarly.

Iā€™m curious to see a plot of cpu time/sloc.

Good idea, will add to notebook. Define speed as 1/cputime, then speed/linecount = 1/(cputime*linecount). Plot normalized so C++ = 1.

3 Likes

RIght, cpu time/sloc was actually a bit silly :smiley:

Iā€™m more and more convinced that Julia doesnā€™t necessarily win the pure speed race, but usually outperforms other languages when taking into account the time you spend writing the code (very roughly measured with the number of lines you have to have to write).

4 Likes

I think if you write Julia like itā€™s C (that is, using pointer, etc.) then it can tie with C and Fortran.

No one writes Julia code like that, and for good reason, but itā€™s not fair to say Julia doesnā€™t ā€œwinā€, since itā€™s more a statement about coding style than the language itself.

1 Like

Well of course Julia doesnā€™t necessarily win. Thereā€™s a way to rewrite the C or Fortran algorithms here to get it as fast as the Julia version. I donā€™t think anyone would argue that. But I think itā€™s safe to say that Juliaā€™s optimizations may actually beat a lot of scientistā€™s ability to ā€œhand optimizeā€ C/Fortran code. And what you get in return is something that is much more concise which then makes it easier to add all the bells and whistles. You can probably dig into the Fortran code here and pull out another 10%, but in the Julia version you can add Threads.@threads and maybe add PI adaptive timestepping in 20 lines and end up with something orders of magnitude faster, still ending up with more concise code. Thatā€™s why Julia does well.

And actually, if Julia gets an @noalias macro for creating local scopes which guarantee arrays arenā€™t aliasing (or something like volatile), Julia could probably match the last few compiler optimizations which require the absence of aliasing. At that point it probably comes down to which compiler is used more than anything else, and gcc vs clang/LLVM seems to be a toss up: usually very similar, but in some problems one or the other ā€œwinsā€, but you can never guess which one (and when asking people to recompile whatever reference C code via clang and checking the timing difference actually accounts for most benchmark variation from what Iā€™ve seen, i.e. the difference between C and Julia usually is mostly measuring a difference between gcc and clang on that algorithm).

7 Likes

It would be interesting to have benchmarks for ETDRK4, which is designed for stiff PDEs such as KS. The special functions that are required (at least with periodic boundary conditions) could be extracted from here:

https://github.com/JuliaApproximation/ApproxFun.jl/blob/master/src/Extras/specialfunctions.jl#L648

and an ApproxFun-style code that could be optimized for non-adaptivity is here:

https://github.com/JuliaApproximation/SpectralTimeStepping.jl/blob/master/src/timeevolution.jl#L53

Related references:

S. M. Cox and P. C. Matthews. Exponential time differencing for stiff systems. J. Comp. Phys., 176:430ā€“455, 2002.
A.-K. Kassam and L. N. Trefethen. Fourth-order time-stepping for stiff PDEs. SIAM J. Sci. Comput., 26:1214ā€“1233, 2005.
H. Montanelli and N. Bootland. Solving periodic semilinear stiff PDEs in 1D, 2D, and 3D with exponential integrators. arXiv:1604.08900, 2016.

3 Likes

To do exponential integrators well we really need to up our game with expmv methods. The following two libraries exist:

https://github.com/marcusps/ExpmV.jl

https://github.com/acroy/Expokit.jl

The first actually uses a dense normest2 so itā€™s not the true Higham algorithm yet and that slows it down. The latter uses a Krylov-based method, but doesnā€™t have all of the phi function calculations yet which are necessary for the high order methods like ETDRK4.

Though with L coming from a spectral discretization itā€™s probably small and dense so it might not need these tools as much? But the lack of such tools (i.e. implement a way to choose between dense and these different expmvs) is what is slowing down the generic DiffEq implementation of exponential integrators.

Iā€™m guessing no: the point of spectral algorithms is that the matrix is dense, but the matvecs are O(n log n) because of FFTs. So what it really needs is a matrix-free interface where only A*x is used.

While the FFT is matrix-free, the exponential integrator is not matrix-free since exp(A*t) is a dense operator which is unknown but necessary. Thatā€™s why expmv(A,dt,v) = exp(A*dt)*v is important for any sufficiently large problem to be solved with an exponential integrator.

When n is small, exp(A*dt) can be cached. Most FEM/FDM discretizations are not in this domain for sufficiently difficult problems, but spectral discretizations may be in this domain? If they are, that simplifies things a bit.

One question I have though. Does \ for an FFT call the inverse FFT? If so, that would be nice because then the algorithm can be more easily written on a generic A than whatā€™s done in this benchmark. Otherwise weā€™d need a good way to get inv(A) from the user for the specific case of FFTs.