Implementing parallel fluids code in Julia (parallel FFTW, MPI.jl, ...)

John_Gibson · March 15, 2017, 10:24pm

I’m considering implementing a parallel fluid simulation code in Julia, either reimplementing my Fourier-Chebyshev http://www.channelflow.org code, or doing a simpler isotropic turbulence in a triply-periodic cube as a simpler test case.

But it’s not clear to me how to do it with Julia’s current parallel infrastructure. MPI.jl appears to be limited to julia-0.4 (GitHub - JuliaParallel/MPI.jl: MPI wrappers for Julia), and this 2014 post says Julia does yet support distributed-memory FFTW calls (Redirecting to Google Groups).

I would actually like to leave MPI behind and do parallelism at a higher level with distributed arrays. and native Julia parallel constructs. But distributed memory FFTW is a must, including FFTW’s distributed-memory transpose functions.

Anyone out there with relevant Julia expertise and experience in this kind of distributed-memory scientific computation? Tips and pointers would be much appreciated. @stevengj?

ChrisRackauckas · March 16, 2017, 12:24am

Look at ClusterManagers.jl

From what I have read, I think you can use this to make Julia’s internal parallelism use the fast (Infiniband) interconnects of an HPC, getting you the right speed but also giving you an easier programming model. Otherwise Julia reverts to TCP?

That said, I forget where I read that… and haven’t given it a try. Hopefully, someone does know this and will clarify. I am interested in starting to do something like this as well.

andreasnoack · March 16, 2017, 1:03am

MPI.jl works fine with 0.5. It probably needs some updates before it works with 0.6 though. The comment about 0.4 only covers the MPIManager and it was added when 0.3 was the latest released version of Julia.

You are not the first to show interest for distributed memory FFTW but I’m not aware of anybody who has wrapped the functionality yet. It might be possible to get it work with a combination of MPI.jl and DistributedArrays.jl but it will require some work.

barche · March 16, 2017, 10:35am

Are the Julia constructs compared somewhere against MPI? How does this work together with e.g. Infiniband? Or is MPI still used in the backend somehow?

antoine-levitt · March 17, 2017, 7:23am

+1 on the original question. I’d love to see any return of experience from a julia port of a “standard” scientific computing application using e.g. MPI+OpenMP with FFTs and distributed linear algebra, and use on supercomputers. For all the talk about julia being “Designed for parallelism and distributed computation” this is strangely lacking.

Pbellive · March 17, 2017, 7:40pm

The Cluster Managers with custom transports section in the Julia documentation has a little information on creating custom transport layers for Julia’s native distributed memory parallelism. There’s an example in the Julia distribution of an implementation using the zmq library. I’ve had a thorough look through the internet for an infiniband implementation but haven’t been able to find one. I looked into doing it myself and it looks like a lot of work. However, I know very little about networking so maybe it wouldn’t be so bad for an expert.

On a personal note, I’ve found Julia’s native parallel programming constructs to work great for very coarse grained embarrassingly parallel calculations. I’d love to see some examples of pure Julia implementations of algorithms with more complex communication patterns. Coming from an MPI programming background the Julia constructs seem quite awkward for this but it could well be that I’m just missing something—I should really try to get a deeper understanding of channels and RemoteChannels.

John_Gibson · March 22, 2017, 5:18pm

@ChrisRackaukus and @andreasnoack, thanks for the clarifications and advice.

@antoine-levitt: Agreed. I have high hopes for Julia as a game-changer in scientific computing, and I think that a real distributed-memory scientific HPC code would be an important demonstration of Julai’s potential to replace the ageing and painful C/Fortran/MPI framework. I recently gave a “Why Julia?” talk to a group of fluid dynamicists, and this was the main response I got: show me benchmarks of a Julia HPC code against C and Fortran.

I think my work is cut out for me: try to get distributed-memory FFTW in Julia working and coordinated with DistributedArrays.jl and MPI.jl. Once that’s in place, write a parallel fluid simulation code. Piece of cake!

ChrisRackauckas · March 22, 2017, 6:06pm

John_Gibson · March 23, 2017, 6:58pm

@stevengj : What are the prospects for getting distributed-memory FFTW in Julia working with DistributedArrays? Is there a fundamental structural clash between C/MPI FFTW and DistributedArrays.jl that makes it difficult and unlikely, or is it clear how to do it and just not done yet?

Sorry to poke you individually, but you’re the expert on this. I’m willing to do the work if I can, but I will need some guidance.

stevengj · March 23, 2017, 7:45pm

At first glance, I don’t think there should be any particular problem, since DistributedArrays.jl lets you specify the distribution. So, you would specify the same distribution as FFTW, adjusted for row-major vs. column-major (i.e. divided long the last dimension in Julia, corresponding to the first dimension in FFTW), create a plan, and execute it on the local part of the array in each process.

A slight annoyance is building the FFTW dependency, because Julia already ships with the serial FFTW (although this is likely to be split into a separate package in the future). You would need to make sure to build the FFTW MPI library using the same version of FFTW as is shipped with Julia, since the MPI FFTW calls the serial FFTW.

Ralph_Smith · March 24, 2017, 3:59am

I’m curious about your priorities here. Unless you have enormous domains in view, it would seem that the best target is a cluster with at least a few nodes having enough memory to hold an entire scalar field, so all transforms can be done locally, ideally with GPU or Phi acceleration. (For the record, I use FFTW almost every day, and am eternally grateful for its existence, but it’s not the only player in town.) You have to deal with data distribution for field products, time stepping, and so on anyway, so why start with transforms?

John_Gibson · March 24, 2017, 11:47am

The classic case of channel flow has a fluid in a 3d rectangular box with rigid walls at y=\pm 1. The boundary conditions are no-slip, i.e. zero velocity at the walls and periodic in the wall-parallel directions x,z. The appropriate spatial discretization is then a Fourier x Chebyshev x Fourier spectral expansion for x,y,z. At each time step, you have to Fourier transform in each of the three directions (real cosine transform in y for the Chebyshev expansion) in order to compute the nonlinear term u \dot \grad u in the Navier-Stokes equation. To do this you transpose the data in distributed memory so that the data is x-major, do Ny Nz local transforms in x, then transpose so the data is y-major, do Nx Nz local y transforms, etc. Distributed-memory FFTW has all this worked out, so you call an appropriate 3d transform function and it does all the appropriate distributed-memory transposes and local transforms for you.

P.S.How do you turn on Latex markup in Discourse?

John_Gibson · March 24, 2017, 12:03pm

For even very modest-sized spatial domains and discretizatons, say 128 x 129 x 128, distributed memory and parallel processing is a win in execution speed, even though that amounts to only a few hundred megabytes of data. But simulations of thousands x hundreds x thousands are commonplace, and for this distributed memory is a necessity.

John_Gibson · May 16, 2017, 5:50pm

The transforms are the most complicated part of the algorithm from a data-management perspective, because of all the distributed-memory data transposes necessary for the multi-d FFTs. If I can get the distributed-memory FFTs and transposes under control, the rest of the algorithm is totally manageable, even though it’s more complex mathematically. For an Nx, Ny, Nz grid, the rest of the algorithm is just Nx * Nz independent and trivially parallelizable 1d boundary value problems in y.

favba · June 28, 2017, 8:56pm

Hi @John_Gibson ,

I’m also considering implementing a parallel fluid simulation code in Julia, more specifically the isotropic case. Have you had any success getting the distributed-memory FFTW working in Julia? I believe the only missing feature I need is the distributed transpose (ideally an in-place transform), the rest could be done with the serial FFTW.
I’m planning on tackling the distributed transpose problem first. That is, if no one has implemented it yet…

dpsanders · June 29, 2017, 2:20am

I would suggest doing a serial version first, both as a proof of concept and a reference for benchmarking any future parallel versions.

John_Gibson · June 29, 2017, 3:36pm

I have not started coding yet. I spoke to a number of people about this problem at JuliaCon last week. Andreas Noack and Viral Shah both indicated that DistributedArrays should be able to do the transposes via a permutation function.

If you do the periodic-cube isotropic problem, I’ll do the channel-geometry problem.

John_Gibson · June 29, 2017, 3:45pm

You might be right about that. There’s plenty of work to do just to get the mathematical part of the algorithm working, besides the parallelism. There’s no question that Julia is ready for the serial algorithm, and it would be a good process for me to learn to use the language properly.

Maybe I’ll approach it from boths ends, developing the math part serially while experimenting separately with parallel transforms and transposes.

antoine-levitt · June 29, 2017, 6:32pm

Please keep us updated if and when you do that. I’m in particular curious of how feasible it is to build a working complex code entirely out of reusable parts. (e.g. discretize your problem using ApproxFun, solve it using DiffEq/NLSolve…). Those packages look flexible and open enough that it might be possible.

John_Gibson · June 29, 2017, 7:28pm

I’ll definitely keep you & this thread posted.

At this point I don’t think using ApproxFun and DiffEq is the right approach. The time-stepping and spatial discretization methods for these kinds of algorithms are just too tightly intertwined. (see pages 29-35 of http://channelflow.org/dokuwiki/lib/exe/fetch.php?media=docs:chflowguide.pdf)

I would be interested in trying to combine ApproxFun and DiffEq to solve a simpler test case of a spectral PDE algorithm, say for the 1d Kuramoto-Sivashinky equation on a periodic domain, or on a boundded domain with Dirichlet or Neuman BCs. If it works there, it’d be worth thinking about for Navier-Stokes. I have already implemented Kuramoto-Sivashinsky in a periodic domain in Julia (in order to benchmark it against the same algorithm in Matlab, Python, and C++ --and Julia wins!). I’d be happy to share that if anyone wants to take a shot at a ApproxFun+DiffEq implementation.

Topic		Replies	Views
Benchmarking a simple PDE algorithm in Julia, Python, Matlab, C++, and Fortran Numerics diffeq	41	9012	November 19, 2019
Shallow water equations with DifferentialEquations.jl Numerics question , diffeq	22	4874	May 24, 2019
Tutorials or examples on multithreading and distributed processing for numerical PDEs New to Julia	8	1278	June 2, 2021
Implementing parallel computing in finite element problems Numerics	6	2069	August 15, 2018
Questions on a number of code acceleration techniques General Usage performance , hpc , parallel	11	1807	July 8, 2017

Implementing parallel fluids code in Julia (parallel FFTW, MPI.jl, ...)

Related topics