Benchmarking a simple PDE algorithm in Julia, Python, Matlab, C++, and Fortran

Yes:

julia> x = randn(4)
4-element Array{Float64,1}:
  0.971569
  2.11743 
 -0.602666
 -0.773089

julia> p = plan_fft(x)
FFTW forward plan for 4-element array of Complex{Float64}
(dft-direct-4 "n1fv_4_avx")

julia> p\(p*x)
4-element Array{Complex{Float64},1}:
  0.971569+0.0im
   2.11743+0.0im
 -0.602666+0.0im
 -0.773089+0.0im

(not sure what that buys you though)

About caching exp(A dt), I would imagine the dense matrix does not even fit into memory for sufficiently interesting problems, but I’ll let the experts answer.

When the boundary conditions are periodic and a Fourier spectral method is used, the stiff linear operator L usually operates diagonally on the coefficients. It doesn’t get any better than that, and it’s suitable for this kind of benchmarking.

I’m not sure what the right way forward is for aperiodic boundary conditions, but this gets opinionated quite quickly.

That’s what the issue is. Also if dt changes then it’s expensive to do an expm. It’s a lot like iterative methods vs non-iterative methods for solving linear equations.

Grr… no! :rage: Spectral methods have banded matrices (at least as constructed in ApproxFun / [Olver & Townsend 2012] / [Slevinsky & Olver 2017]). Don’t use collocation!!

2 Likes

The Fortran segfault should be fixed by the PR that I just opened. I also filed two other issues about the Fortran code.

1 Like

I made a (unregistered and undocumented) package to perform fftw’s in-place real ffts (rfft!). There is only little modification to your code, see it in ksintegrateInplacerfft.jl
I’m getting the following times on my machine:

julia> ksbenchmark(2^17,ksintegrateInplace)
elapsed time: 21.772487676 seconds
elapsed time: 19.861860891 seconds
elapsed time: 22.03936812 seconds
elapsed time: 21.817528044 seconds
elapsed time: 20.547672813 seconds
avgtime = 21.066607466999997
21.066607466999997

julia> ksbenchmark(2^17,ksintegrateInplacerfft)
elapsed time: 12.375200435 seconds
elapsed time: 12.987480447 seconds
elapsed time: 11.781444451 seconds
elapsed time: 12.473019557 seconds
elapsed time: 12.420461824 seconds
avgtime = 12.41560156975
12.41560156975

julia> ksbenchmark(2^17,ksintegrateUnrolled)
elapsed time: 20.483904677 seconds
elapsed time: 18.830359237 seconds
elapsed time: 20.358235657 seconds
elapsed time: 18.956146378 seconds
elapsed time: 20.313768637 seconds
avgtime = 19.61462747725
19.61462747725

One can also use the in-place rfft with the other languages (Fortran and C++ at least), but I doubt it could be made with this little modification on the source code.


Ps: Odly, when I use my module on the Unrolled case it becames orders of magnitude slower. Probably my module is still missing some optimization to make my PaddedArray work as a normal array on indexed for loops.


Ps2: If you want I can make a PR to your repository.

2 Likes

Thanks! I merged and updated the new Fortran data into the notebook plots. With actual large-Nx data, the Julia unrolled code is in a dead heat with Fortran. I also fixed up your two other issues. BTW, your icon looks an awful lot like a modified Larks’ Tongues in Aspic album cover. King Crimson fan?

That’d be great, please do. I have the benchmark algorithm with real-to-complex FFTs implemented in C++ and Fortran already. Both are pretty simple modifications. Python can do it with Numpy, though I haven’t coded it yet. Not sure about Matlab. Showing the source modifications and timings for this in all languages will be another good comparison.

Thanks, I will look this up. My fluids code (www.channelflow.org) implements a number of higher-order methods in a modular way; probably the best is 4th-order semi-implicit backwards differentiation. I’ve been meaning to learn more about exponential integrators. For this benchmark I did CNAB2 to avoid coding more complex methods from scratch in N languages, especially the painful languages. But it could be a good illustration of how Julia makes complex algorithms much easier to code.

This is the best to the moment benchmark on PDE in Julia that I see and best that I can find on Discourse. But this was a year ago and since we now have Julia 1.0, things can different now, like relative performance of naive, inplace and unrolled Julia code. Can I ask for your impressions, about current state of affairs?

I still new to the language and a bit confused, also have problem with not as much time in September as I want, so maybe I ask to many stupid questions trying figuring out what to do and where to start. If this is one of them, I apologized for that.

The performance for those kinds of things will be relatively the same. The difference will just be some extra optimizations due to a newer LLVM version, but that’s about it.

Where we are going to next are trying to get package-based solutions to match the hand-optimized versions. @John_Gibson’s work is great because it gives us a baseline to keep testing against. This is what it’s looking like so far:

we have a WIP PR to add the linear specialization handling to hopefully finish this up.

Basically we have all of the methods in DifferentialEquations.jl and ApproxFun.jl by now, but they were written for fully nonlinear equations. After 501 merges the DiffEq version of the algorithm will be very similar to the hand-optimized one when presented with linear operators (almost the same, except one iteration of Newton’s method instead of a linear solve so… the same), and then we’ll just need to make sure everything unnecessary compiles away and we should be good. Fingers crossed.

The reason this is interesting is because it shows that with pure Julia codes you can build the pieces of a PDE solver in different libraries and tie them together to build something fully efficient. Architecturally we know that will be true, but it’s good to have a full integration test like this to really show it in action. I think that once we match this with packages is a good sign that shared-memory PDE computing is in a good spot. We’ll need to flesh out more of the packages for spatial discretizations but in general it’s pretty good.

Where we need to look next is parallelism. We are getting some larger examples on the GPU with CUDANative setup, and I know @John_Gibson is talking about distributed FFTs, and we want to see if we can make our lazy PDE operators automatically do parallel computations. Julia v1.0’s broadcast overloads along with the abstract array interfaces are great way to make full codes generalize like this, though there are a few hiccups we need to work out.

8 Likes

Thank you for answer. It’s hard to say how much I respect your achievements and commitments to community. Wish you all the best with this project.

1 Like

I was thinking about putting this benchmarks in my presentation about Julia, but since code don’t work in Julia 1.x, this looks as bad idea. People don’t believe benchmarks of not working code and putting version 0.6 (?) on my computer alongside 1.x is unnecessary complication to me. Is there a chance that this code will be updated soon or better look for something else? Updating this code is beyond me at this moment.

I already posted on WhyJulia GitHub accounts some of problems. Main is that FFT changes they place and I don’t know where they are.

I’ll update that notebook for Julia 1.0. Thanks for the prompt.

Thank to you, for all your work.

I allow myself to get this benchmarks by dirtiest hack that I used till now: making screenshots and edit they in GIMP. I hope this is not a problem, I need to show on one presentation benchmarks for one PDE and this is clearest and most informative that I know.

Good luck in your work.

1 Like

Recently I check some pre-2018 Julia benchmarks and it’s looks that so many things change in the internals of 1.x, that the old conclusion wasn’t correct. For example, method which compute result fastest was different than before.

In that context, I wonder is there is a chance that someone can update these benchmarks to 1.x? I still don’t have enough knowledge to do it myself, and I most probably I wouldn’t have in feasible future. And I use this benchmark quite often to show Julia speed.

Why sell Julia when you’re unable to do what you’re selling

Oh, I am behind on updating this notebook. I’ll get to it within a week. Thanks for the ping.

1 Like

90% of my numerical computation is calculating some numerical integrals not PDE. But audience have many PDE guys, that the reason.

1 Like

I just merged an update for this notebook to julia-1.2.0. The PDE codes didn’t need any changes at all. There were a few small tweaks needed for changed Plots and DelimitedFiles syntax.

8 Likes

So if I might ask.

What is now the reason for performance difference between the Julia and Fortran as of Julia 1.2?

Is it just the difference between compilers or some language level optimisations (is there @noalias macro in Julia now?)?

1 Like