As a warm-up and preliminary test for writing a distributed-memory fluids code in Julia, I did some benchmarking of Julia on a simple numerical integration algorithm for the 1d Kuramoto-Sivashinky equation.
Briefly, the results show that in execution speed, Julia beats Python and Matlab significantly and is competitive with C++ and Fortran. In line count versus speed, it hits the sweet spot: almost as compact as Python (~20 lines), almost as fast as Fortran (~60 lines).
The 1d Kuramoto-Sivashinky (KS) equation is
u_t + u_{xx} + u_{xxxx}+ u u_x = 0
where x is space and t is time and subscripts indicate differentiation. The algorithm assumes a finite periodic domain, discretizes space with a Fourier decomposition and time with 2nd-order Crank-Nicolson, Adams-Bashforth finite-difference time-stepping. The implementations in all languages use the same FFTW library so the benchmark measures language-specific overhead for things like bounds-checking and allocation of temporary arrays. The equation and algorithm are a sort of super-reduced toy model of Navier-Stokes and its algorithms.
Iām eager for feedback from more knowledgable Julia people, and I could use some help implementing the algorithm in Numba-Python, Dedalus-Python, and DifferentialEquations.jl & ApproxFun.jl. I could also use help figuring out why the Fortran code seg faults for N_x \geq 8192.
For some reason this works
better (runs faster) than the unrolled loops produced by the Julia compiler.
Yes, Nu .= u is still slower than youād expect, though copy!(Nu,u) does about as well as the unrolled loop. This is because broadcast still has a cost, which hopefully gets improved in the future.
A DiffEq Crank-Nicolson Adams-Bashforth is a ways off and there are other things which have higher priority right now. But Iāll revisit this later. Thanks for making this.
One further way you can improve the benchmark is that, now that itās all unrolled, add Threads.@threads to each of the loops (make sure you export JULIA_NUM_THREADS=n before opening Julia: https://docs.julialang.org/en/stable/manual/parallel-computing/#Setup-1). Then both the loops and the FFTs will be multithreaded. Itās not cheating because MATLAB implicitly multithreads its vectorized computations, so this is just putting Julia in a fair place.
Thanks for the optimization suggestions. Running julia as ājulia -O3ā shaved 5% or so off the Julia timings and got Julia-inplace below C++ and Julia-unrolled below Fortran. Iāve updated the notebook to reflect.
Try ulimit -s unlimited in the shell before running the Fortran code. In my experience Fortran compilers like to put temporary/local arrays on the stack, but the default stack size is quite limited (8 MiB) on most systems nowadays. But I admit I havenāt looked at your code.
I later realized, that simply āclickingā the hyperlink opens the right version of the code on Github. So I found the code.
However, I am very much used to āOpen in new tabā or download a file through right click, both of which failed for me on this link.
Any idea why this is not working? Is this a general issue with Jupyter notebooks (I have no experience with Jupyter).
Tried it, but it didnāt work. After compiling and running with ulimit -s unlimited and Nx=8192, the stack trace in gdb is
gibson@sophist$ gdb ksbenchmark-f90
...
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
(gdb) where
#0 0x00002aaaaadc9153 in dfftw_execute_ () from /usr/lib64/libfftw3.so.3
#1 0x00000000004010b0 in ksintegrate (nt=3200,
u=<error reading variable: value requires 131072 bytes, which is more than max-value-size>) at ksbenchmark.f90:118
#2 0x00000000004018bb in MAIN__ () at ksbenchmark.f90:55
(gdb)
Clearly a memory error. I guess an Nx=8192 complex double-precision array totals 8192 * 2 * 8 = 131072 bytes. I suppose Iāll try to understand max-value-size in Fortran.
I did revise the Fortran code to use heap-allocated arrays but got the same error --which is hard to understand. Iāll double-check that and post the revised code.
Thank you! Links in these documents are not direct, but typically route a file specified in the URL through a āviewerā webservice that reads the file, translates it to nicely formatted HTML, and displays that.
Iām more and more convinced that Julia doesnāt necessarily win the pure speed race, but usually outperforms other languages when taking into account the time you spend writing the code (very roughly measured with the number of lines you have to have to write).
I think if you write Julia like itās C (that is, using pointer, etc.) then it can tie with C and Fortran.
No one writes Julia code like that, and for good reason, but itās not fair to say Julia doesnāt āwinā, since itās more a statement about coding style than the language itself.
Well of course Julia doesnāt necessarily win. Thereās a way to rewrite the C or Fortran algorithms here to get it as fast as the Julia version. I donāt think anyone would argue that. But I think itās safe to say that Juliaās optimizations may actually beat a lot of scientistās ability to āhand optimizeā C/Fortran code. And what you get in return is something that is much more concise which then makes it easier to add all the bells and whistles. You can probably dig into the Fortran code here and pull out another 10%, but in the Julia version you can add Threads.@threads and maybe add PI adaptive timestepping in 20 lines and end up with something orders of magnitude faster, still ending up with more concise code. Thatās why Julia does well.
And actually, if Julia gets an @noalias macro for creating local scopes which guarantee arrays arenāt aliasing (or something like volatile), Julia could probably match the last few compiler optimizations which require the absence of aliasing. At that point it probably comes down to which compiler is used more than anything else, and gcc vs clang/LLVM seems to be a toss up: usually very similar, but in some problems one or the other āwinsā, but you can never guess which one (and when asking people to recompile whatever reference C code via clang and checking the timing difference actually accounts for most benchmark variation from what Iāve seen, i.e. the difference between C and Julia usually is mostly measuring a difference between gcc and clang on that algorithm).
It would be interesting to have benchmarks for ETDRK4, which is designed for stiff PDEs such as KS. The special functions that are required (at least with periodic boundary conditions) could be extracted from here:
S. M. Cox and P. C. Matthews. Exponential time differencing for stiff systems. J. Comp. Phys., 176:430ā455, 2002.
A.-K. Kassam and L. N. Trefethen. Fourth-order time-stepping for stiff PDEs. SIAM J. Sci. Comput., 26:1214ā1233, 2005.
H. Montanelli and N. Bootland. Solving periodic semilinear stiff PDEs in 1D, 2D, and 3D with exponential integrators. arXiv:1604.08900, 2016.
The first actually uses a dense normest2 so itās not the true Higham algorithm yet and that slows it down. The latter uses a Krylov-based method, but doesnāt have all of the phi function calculations yet which are necessary for the high order methods like ETDRK4.
Though with L coming from a spectral discretization itās probably small and dense so it might not need these tools as much? But the lack of such tools (i.e. implement a way to choose between dense and these different expmvs) is what is slowing down the generic DiffEq implementation of exponential integrators.
Iām guessing no: the point of spectral algorithms is that the matrix is dense, but the matvecs are O(n log n) because of FFTs. So what it really needs is a matrix-free interface where only A*x is used.
While the FFT is matrix-free, the exponential integrator is not matrix-free since exp(A*t) is a dense operator which is unknown but necessary. Thatās why expmv(A,dt,v) = exp(A*dt)*v is important for any sufficiently large problem to be solved with an exponential integrator.
When n is small, exp(A*dt) can be cached. Most FEM/FDM discretizations are not in this domain for sufficiently difficult problems, but spectral discretizations may be in this domain? If they are, that simplifies things a bit.
One question I have though. Does \ for an FFT call the inverse FFT? If so, that would be nice because then the algorithm can be more easily written on a generic A than whatās done in this benchmark. Otherwise weād need a good way to get inv(A) from the user for the specific case of FFTs.